The unprecedented success of large language models (LLMs) has changed the way people solve new problems in machine learning. Compared to conventional end-to-end training where a neural network is trained from scratch on some curated dataset, it has become increasingly more popular to leverage a pretrained LLM and design good prompts that contain in-context examples and effective instructions. These two ways of problem-solving lead to an intriguing comparison. Traditionally, we would optimize a neural network in a continuous numerical space using gradient descent, while in the new approach, we optimize the input prompt of an LLM in a discrete natural language space. Since a neural network is effectively a function parameterized by its numerical weights, can a pretrained LLM act as a function parameterized by its text prompt?
Driven by this question, we conceptualize the framework of verbalized machine learning (VML), which uses natural language as the representation of the model parameter space. The core idea behind VML is that we can define a machine learning model using natural language, and the training of such a model is based on the iterative update of natural language. This framework enables many new possibilities for interpretability, as the decision rules and patterns learned from data are stored and summarized in natural language. Specifically, we propose to view the input text prompt of LLMs as the model parameters that are being learned. However, optimization over such a natural language parameter space also introduces additional difficulties. Inspired by previous work [5, 25] where the optimizer is viewed as a function parameterized by a neural network, we parameterize the optimizer function as another LLM, which produces the next-step model parameters by taking in the current model parameters, a batch of training data points, and the loss function. Therefore, VML requires the optimizer LLM to update the learner LLM iteratively towards the training objective.
Compared to conventional numerical machine learning, the VML framework brings a few unique advantages. First, VML introduces an easy and unified way to encode inductive bias into the model. Because the model parameters are fully characterized by human-interpretable natural language, one can easily enter the inductive bias using language. This linguistic parameterization makes machine learning models fully interpretable and adjustable. For example, if the input and output data are observed to be linearly correlated, then one can use this sentence as part of text prompt. How to effectively encode inductive bias is actually a longstanding problem in machine learning, and VML provides a unified way to inject the inductive bias through natural language—just like teaching a human learner. Second, VML performs automatic model selection during the learning process. The optimizer LLM can automatically select a suitable model class based on the training data and verbalized prior knowledge. Third, each update of the model is fully interpretable in the sense that the optimizer LLM can give an explanation of why it chooses such an update. One can even interact with the optimizer LLM in order to inject new prior knowledge or obtain detailed reasoning.
VML can be viewed as a natural generalization of in-context learning (ICL). Specifically, ICL is a single-step implicit learning process, while VML is a multi-step iterative learning process where the in-context examples are summarized into verbal pattern and knowledge. Moreover, VML offers a sequential (or conditional) way for scaling inference-time compute [7, 48]. Compared to the best-of-N re-sampling, VML iteratively updates its model parameter prompt by taking into account the learner’s past predictions.
An important concept of VML is its unified token-level representation of both data and model. Unlike numerical machine learning, language models in VML do not differentiate data and model, and treat both of them as part of the text prompt. This shares a striking connection to stored-program computers, also known as the von Neumann architecture, where the key idea is to represent programs as data rather than wiring setups. The link between language models and stored-program computers underscores the importance of text prompts, which play a similar role to computer programs, and, along with LLMs, can become a powerful zero-shot problem solver. Our contributions are as follows:
• We formulate the framework of verbalized machine learning, where pretrained language models are viewed as function approximators parameterized by their text prompts. Then, we revisit a few simple machine learning problems and show that VML is able to solve them.
• We design a concrete VML algorithm with a text prompt template. This algorithm parameterizes both the learner model and the optimizer as LLMs, and enables the iterative verbalized training.
• We conduct empirical studies for the injection of verbalized inductive bias and show that it is promising to use natural language as a unified way to encode prior knowledge. Moreover, we validate the effectiveness of VML in different applications (Section 4, Appendix A, E,F,G,H).
LLMs for planning and optimization. Language models are used to perform planning for embodied agents [49, 60, 26, 28], such that they can follow natural language instruction to complete complex tasks. More recently, LLMs have been used to solve optimization problems [63]. Specifically, the LLM generates a new solution to an optimization problem from a prompt that contains previously generated solutions and their loss values. The LLM optimizer in [63] shares a high-level similarity to our work, as we also aim to solve an optimization problem with LLMs. The key difference to [63] is our function approximation view of LLMs, which enables us to revisit classical machine learning problems and solve them in the VML framework.
Natural language to facilitate learning. [45, 23, 24, 37, 71] show that natural language captions serve as an effective supervision to learn transferable visual representation. [35, 38, 40, 34, 66, 61] find that natural language descriptions can easily be turned into zero-shot classification criteria for images. [4] proposes to use natural language as latent parameters to characterize different tasks in few-shot learning. In contrast, VML uses the text prompt of LLMs to parameterize functions and learns this prompt in a data-driven fashion.
Prompt engineering and optimization. There are many prompting methods [56, 72, 73, 54, 67, 68, 58] designed to elicit the reasoning ability of LLMs. To reduce the efforts in designing good prompts, prompt optimization [72, 73, 63, 41, 57, 11, 27, 32, 50, 70] has been proposed. VML can be viewed as a special instance of prompt optimization, but unlike many current generic prompt optimization methods that search the optimal text prompts through best-of-N sampling without reasoning, VML updates its text-based parameters by explicitly reasoning about the incorrect predictions and learning the underlying data pattern based on the reasoning outcome, which ensures that the learner in VML remains fully interpretable. To summarize, the difference between VML and current generic prompt optimization (e.g., [73]) is similar to the difference between gradient-based and gradient-free optimization. Another subtle difference that separates the two is that the goal of VML (like classical machine learning) is to learn a model to recognize a generalizable pattern in a given training set, while the goal of current prompt optimization (like classical optimization) is more general, which is to simply optimize an objective function (without the need to learn the underlying pattern, e.g., [41]). We can phrase the goal of a machine learning problem into an optimization objective, and use the tools from optimization to solve it, but the focus of the two areas are fundamentally different. We compare the difference of the two in the experiments (see Section 4.8 and Appendix D). We observe that current prompt optimization often results in a generic instruction rather than a description of the data pattern.
LLMs for multi-agent systems. Due to the strong instruction-following ability, LLMs are capable of playing different roles in a multi-agent systems. [42, 59, 15, 22] study a multi-agent collaboration system for solving complex tasks like software development. VML can also be viewed as a two-agent system where one LLM plays the learner role and the other LLM plays the optimizer role.
3.1 From Numerical to Verbalized Machine Learning
Figure 1: A comparison between numerical machine learning and VML.
Classical machine learning models (e.g., neural networks) are typically trained within a continuous numerical parameter space. Once trained, these models are stored as a collection of numerical values that are not interpretable and remain a black box. Motivated by the strong universal problem-solving capability of LLMs, we find it appealing to view an LLM as a function approximator that is parameterized by its own text prompt. This perspective leads to our VML framework. Similar to a general-purpose modern computer whose functionality is defined by its running program, a function that is defined by an LLM is characterized by its text prompt. Due to the fully human-interpretable text prompt, the VML framework provides strong interpretability for its learned function and is also easy to trace the cause of model failure. Figure 1 gives a comparison between numerical machine learning and VML. In the proposed VML framework, both data and model are represented in a unified token-based format, while numerical machine learning treats data and model differently.
3.2 Natural Language as the Model Parameter Space
VML parameterizes a machine-learning model with natural language. More formally, VML places a strong constraint on the model parameters to exchange for interpretability, where
is a text token sequence,
is some text token from a large token set
denotes the set of all natural language sequences that humans can understand. The model parameter space in VML has the following properties: (1) discrete: the natural language space is discrete; (2) sequential: the natural language space is sequential, and the next word is dependent on its previous words. In contrast, the parameter space in numerical machine learning is not sequentially dependent; and (3) human-interpretable: the natural language that characterizes the model is human-interpretable. More discussions are given in Appendix J.
Figure 2: An overview of iterative optimization and text prompt templates of the learner and the optimizer in the regression example.
One of the most significant advantages to use natural language as the model parameters is the easy incorporation of our prior knowledge about the problem and the desired inductive bias into the model training. When the model parameters get updated during training, the model is fully interpretable, and one can observe and understand what gets added and what gets modified. Our empirical evidences also supports our interpretability claim, as we find that the model parameters are typically a language description of the underlying pattern that the model discovers from the training data.
3.3 Language Models as Function Approximators
The core idea behind VML is using a pretrained language model to act as a function approximator, parameterized by its natural language prompt. Specifically, we denote the language model as ) where x is the input data and
is the function parameter. x can be represented as text tokens (or other format such as images if the LLM supports vision input), and the model parameters
is also represented with text tokens. In VML,
) is typically a frozen large language model that is pretrained on a large corpus of text (e.g., DeepSeek-V3 [29], Llama-3 [51], GPT-4 [1]). If we consider a static function, we can set the temperature parameter of the LLM as zero, which theoretically makes the output deterministic. If we set the temperature high (see Appendix I for more discussion),
) can be viewed as performing sampling from some distribution. We revisit how a classical machine learning problem can be formulated in the VML framework. Suppose we have in total N training data points
, where
is the input feature vector and
is the target output value. As an illustrative example, we consider the following least square regression problem using the function
) that is parameterized by natural language description
:
where minimizing the objective function with respect to the discrete token-based model parameters is actually quite difficult. Back-propagating gradients through discrete variables (e.g., policy gradients, Gumbel-softmax [16]) is typically known to be sample-inefficient and sub-optimal.
3.4 Iterative Training by Prompt Optimization
Because the model parameters in VML are text prompts, optimizing
is effectively a prompt optimization problem. Different from current prompt optimization [73], where the goal is to produce a generic prompt without adding new information, the training in VML focuses on updating the model’s language characterization, which involves both the addition of new prior information and the modification of existing information. To optimize the model parameters, we start with the gradient of the regression objective in Equation 1:
where is the learning rate, and the constraint is to ensure that the updated model parameters are still in the natural language space. It seems to be infeasible to compute this gradient. To address this, we view the gradient as a function of the data (x, y) and the current model parameters
. Then we directly approximate the next-step model parameters using another pretrained language model denoted by
ˆy is the model prediction from the learner
denotes the optimizer parameters that characterizes the optimizer settings, and we can use language to specify the update speed, the momemtum, etc. The largest possible batch size of the optimizer LLM is determined by its context window. The optimizer LLM can already output natural language that satisfies the constraint, so we simply ask the LLM to play the optimizer role, which has been shown effective in [63]. More importantly, our VML framework gets better as LLM’s instruction-following ability gets stronger. An overview of the iterative optimization and the prompt templates in the regression example are given in Figure 2. The training procedure is given in Algorithm 1.
Using an LLM as the optimizer offers several unique advantages. First, the optimizer can perform automatic model selection. When the learner model can not make correct predictions for the training data, the optimizer will automatically update the learner to a more complex and capable model (see the polynomial regression experiments in Section 4.2 as an example). Second, the optimizer can provide detailed explanations of why a particular update should be performed, which helps us to understand the inner working mechanism of the verbalized optimization process. Third,
the LLM-parameterized optimizer allows users to interact with it. This not only helps us to easily trace model failures, but more importantly, it also allows us to inject prior knowledge to improve optimization.
Different optimizer parameterizations. In this paper, we use a direct parameterization, i.e., parameterizing the optimizer as a single function , which couples the gradient and the update functions together. Alternatively, we can use an indirect parameterization where the gradient and the update are two separate LLM-parameterized functions. The gradients are known as “textual gradients” in prompt optimization [41, 70]. The update of learner’s model parameter is given by
), where
is computed by
) and similarly,
is computed by
). Both
and
are parameterized by LLMs. Compared to direct parameterization that only takes one LLM call, this process has to use several LLM calls. After empirically comparing both methods in Section 4.7 and Appendix B.3, we find that in most scenarios, the direct parameterization yields better performance.
3.5 Discussions and Insights
VML as a framework to encode inductive bias. A unified framework to encode arbitrary inductive bias has been pursued for decades. For different types of data, we need to design different models to encode the inductive bias (e.g., graphical models [18] for random variables, recurrent networks [14] for sequences, graph networks [17] for graphs, and convolution networks [21] for images). VML uses a unified natural language portal to take in inductive biases, making it very flexible for encoding complex inductive bias. To incorporate an inductive bias about the hypothesis class or prior knowledge about the problem, we can simply concatenate a system prompt (i.e., some constant prefixed text that describes the inductive bias) with the model parameters
. The final model parameters are (
) where
is learnable and
is given by users.
Figure 3: Training dynamics for VML based linear regression. The model is trained for 2 epochs, each with 10 steps.
VML enables interpretable knowledge discovery. Because the model parameters are already in natural language, it is easy to understand the underlying pattern that leads to the prediction and the decision rules that the model uses. Unlike numerical machine learning where the knowledge is learned within a black box, this property enables VML to discover novel knowledge that humans can also learn from.
VML as “the von Neumann architecture” in machine learning. Machine learning usually treats the model parameters and the data differently, similar to the Harvard architecture that stores instruction and data separately. VML stores both data and model parameters in the text prompt as tokens, which resembles the von Neumann architecture that stores instruction and data in the same memory.
We demonstrate the features and advantages of VML by revisiting some classical machine learning tasks followed by a realistic medical image classification task. In these tasks, we are given data and we want to find
such that
) best describes the mapping
. Our experiments below show in detail how VML is able to solve these tasks and find
.
Experiment setups. We use the instruction-tuned Llama-3 70B [51] for the LLM unless specified otherwise. The training set for each task consists of 100 data points. For all tasks, we use a batch size of 10 for each optimization step (see Figure 2 (right) as an example), which corresponds to 10 steps per training epoch. To
Figure 4: Training dynamic for VML based polynomial regression. The model is trained for 2 epochs, each with 10 steps.
evaluate regression performance, we look at the training loss, and the model predictions in both interpolation and extrapolation settings. For classifications, we use additional test sets (20 data points), and evaluate both training and testing accuracies. Inspired by the momentum from classical optimization, we provide the last step (i.e., one step only) of the optimization history to the optimizer LLM for training stability.
Training logs. The results of our experiments are showed using: (a) training loss, which is computed by parsing the model output (string) and converting it in to the same data type as the target value (y), then we use mean squared error for regression, and zero-one loss mean (i.e., average accuracy) for classification. The computed training loss is for logging purpose only, it is not required for training in VML (see Algorithm 1).; (b) visualization of the learned model, which is also done through parsing and converting the model output; (c) the model parameter at each training step i before optimization (i.e., ), and the optimizer output for the updated
. For i > 1, the full model parameter before optimization is
, but in our figures below we only show the
to save space.
Compute. The LLM is ran on a node of 8 A100 using the inference engine provided by vLLM [20]. During each step (i) of training, we query the LLM 10 times for evaluating the model
batch, and 1 time for requesting the newly optimized
. We also evaluate the entire test set at each step, which, depending on the size of the evaluation set, requires between 20 to 100 LLM queries. Overall, for the regression tasks, they take around 10 minutes for each epoch of training. The classification tasks, take around 16 minutes for each epoch of training. The visualization of the decision boundary takes around 6-minute.
4.1 Linear Regression
We generate from a linear function with Gaussian noise, i.e., y = 3x + 4 +
, where
1) and
2). We initialize the model parameter
specifying that the task is a regression task from R to R (see Figure 3(c) Step 1). Figure 3(a) shows that training improves the model, and that it converges.
Figure 5: Demonstration of prior injection, and comparison of Llama-3, GPT-4o and a neural net in the sinusoidal regression setting.
The subplots (b) and (c) show details of the model and optimization at steps 1, 3 and 15. At step 1, since only contain the definition of 1-D regression task, the model
is randomly guessing (see the dashdot line). The optimizer
says that it notices a linear relationship between the input and the target outputs, hence introducing a linear regression model to capture such a relationship, which results in model
being a straight line. From step 2 onward, the optimization focus switches to fitting the identified linear regression model to the data. For example, at step 3, we can see that optimizer
it notices that the outputs of model
generally smaller than the target, suggesting the scaling factor is too small, hence it increases it. Similarly, at step 15, optimizer
it notices the model
overestimates the target; hence, it reduces the scaling factor. We can see from (b) that the resulting model
closely approximates the ground truth.
4.2 Polynomial Regression
We generate from a polynomial function with Gaussian noise, i.e.,
and
1). Similarly,
is initialized by only specifying that the task is a regression task from R to R (see Figure 4(c) Step 1). Figure 4(a) shows that training is effective and converges. Subplots (b) and (c) show details of the model and optimization at steps 1, 2 and 3. At step 1, model
randomly guesses the outputs. The optimizer
says that it notices y has a larger range than x, and that they seem to have positive correlation; therefore, it updates model
to be a simple linear model. This linear model assumption leads to a jump in the training loss (see subplot (a)), as it is far from the ground truth. Consecutively, at step 2, optimizer
the poor performance makes it realize that the linear model oversimplifies the relationship between x and y. It notices a non-linearity between x and y, and to capture this, it uses a quadratic model. This leads to a better model and a large decrease in the training loss. At step 3, optimizer
switches from model class selection to fitting the quadratic model. The resulting model
closely fits the ground truth.
4.3 Sinusoidal Regression
We generate from a sine function with Gaussian noise, i.e., y = sin(x) + 2 + 0
, where
1) and
3). Fitting a sine function is known to be difficult for neural nets in terms of extrapolation. Here, we try GPT-4o, a more powerful model than Llama-3. Figure 5(b; right) shows that when
only the definition of 1-D regression, it results in a linear model after training (see (c; right)). We can add a prior to
by simply saying that the data looks like samples generated from a periodic function, which results in a very good approximation and it extrapolates much better than a neural net (see (b,c; left)). We also find that adding the same prior to Llama-3 is not as effective (see (b,c; mid)), indicating the capability of VML
Figure 6: Linearly separable two blobs classification based on VML. (b) plots the decision boundary of model with
is highly dependent on the capability of the LLM. However, this suggests VML grows with LLM’s scaling law—the effectiveness of VML can improve along with the capability (size) of the LLM.
4.4 Two Blobs Classification
We generate a linearly separable from two blobs on a 2-D plane.
is initialized by only specifying that the task is binary classification on a 2-D plane (see Figure 6(c) Step 1). Subplot (a) shows that training is effective and it converges. At step 1, optimizer
the current batch of data has the pattern that data points with x > 0 belong to class 2, and data points with x < 0 belong to class 1; hence it updates model
to have a linear decision boundary at x = 0, which happens to be perfect. However, Figure 6(a) shows that the training loss does not immediately converge. We can investigate the cause and “debug” the optimizer by looking at optimizer
. From (c) Step 2, optimizer
model
is already quite simple yet accurate, but it wants to further improve the model and utilize the new information from the current batch. Guided by this reasoning, model
becomes a very deep decision tree, and the decision boundary has a reasonable margin towards the data (see Figure 6(b, c; right)). The results also reveal that VML’s interpretable may lead to complex pattern for easy problems, highlighting the importance of specifying a proper inductive bias.
4.5 Two Circles Classification
We generate a non-linearly separable by creating data points on two concentric circles as the two classes. Besides the description of binary classification on a 2-D plane, we add a sentence to encode our inductive bias that the decision boundary is a circle into
(see Figure 7(c) Step 1). At step 1, optimizer
utilizes the prior, and updates model
to have a circle decision boundary. For the rest of the training, the optimizer mainly tries to find a good fit for the radius and the center of the decision boundary. At step 41,
Figure 7: Non-linearly separable 2-circle classification with a prior in . (a; dashed) and (c; bottom right) show results without the prior.
Figure 8: Tiny-PneumoniaMNIST image classification for models with and without prior at initialization.
optimizersays model
seems to be a good fit for the data, and no changes are needed. Hence, it uses the same
model
. Without the prior, VML can also learn a good model, but the performance shows large variance at the beginning of training (see Figure 7(a; dashed)) due to the model selection process similar to Figure 3(a). Figure 7(c; bottom right) shows the resulting
without the prior, which is a decision tree.
4.6 Medical Image Classification
To demonstrate the capability of VML beyond simple machine learning problems, we evaluate the effectiveness of VML for image classification. We use GPT-4o, which supports visual inputs, to take into account both image and text data. The task is to classify whether a X-ray image has indications of pneumonia or not, see Figure 8(b) for image examples. Due to the cost of GPT-4o, we create a subset of the dataset PneumoniaMNIST [64]. Our dataset consists of 100 training data and 100 test data (half pneumonia and half normal for both sets). Models are trained for 5 epochs. We try out two different model parameter initializations, one with prior and one without. We encode the inductive bias by simply adding a sentence as the prior, which states that the input is an X-ray image for identifying pneumonia, along with the definition of binary image classification (see Figure 8(c)). The test accuracy in (a) shows that both models are able to improve their performance on the task as the training epoch increases, and the model initialized with prior outperforms the model without (in terms of both testing accuracy and training convergence). Additionally, by inspecting the parameters of model(see (d)), we observe that the model parameters
for the learner with prior has more medical domain knowledge associated to features of pneumonia (such as “acute infection”, “pneumatocele formation”), while the model parameters
for the learner without any prior mainly use generic visual knowledge associated to features of lung (such as “visible opacities”, “uniform texture”). This observation validates the effectiveness of using natural language to encode inductive bias. Our experiment also demonstrates the usefulness of learning in VML (i.e., the generalization performance can be improved over time), which is distinct from existing prompt engineering methods. Additionally, the interpretable nature of VML’s model parameters is crucial for applications in medical domain. The learned models can be validated by medical professionals, and their predictions are grounded by their verbalized reasoning.
4.7 Ablation Study and Exploratory Experiments
Quantitative comparison to in-context learning. Since VML can be viewed as a generalization of ICL, we therefore compare VML to ICL in all previous applications. Results are given in Table 1. The ICL results are chosen from the best one across 5 runs. The metrics used for regression (Reg) and classification (Cls) are mean square error (MSE ) and
test accuracy (), respectively. We abbreviate linear regression as Reg-L, polynomial regression as Reg-P, two blob classification as Cls-TB, two circle classification as Cls-TC and medical image classification as Cls-MI. We can observe that VML consistently outperforms ICL.
Figure 9: Training loss dynamics. For each configuration, we show 5 individual runs (thin) and their mean (thick).
Scaling effect with stronger LLMs. We are interested in whether the performance of VML can be improved while using a stronger LLM. To evaluate how VML scales with stronger LLMs, we use Llama-3.1 with different size (8B, 70B, 405B) as the backbone LLM for VML. From Figure 9(a), we can see that stronger LLMs (e.g., 405B) can indeed enable VML to learn faster and achieve lower loss in the linear regression setting. This result shows the great potential of VML when the LLMs get better.
Direct vs. indirect parameterization. As discussed in Section 3.4, we can use either a direct or an indirect way to parameterize the optimizer. We compare both parameterizations using the linear regression setting (as described in Section 4.1). Figure 9(b) shows that the direct parameterization outperforms the indirect one. The direct parameterization leads to faster convergence while requiring less LLM calls. Detailed experimental settings and more discussions are given in Appendix B.3.
4.8 Comparison Between Generic Prompt Optimization and VML
To better differentiate VML from prompt optimization, we compare VML to a generic prompt optimization method called Automatic Prompt Engineer (APE) [73] both conceptually, and qualitatively on two tasks. We use Llama-3-70B for both APE and VML. Even though both methods aim to optimize a prompt towards a pre-defined objective function, there are fundamental differences between the two. Conceptually, APE first samples a set of candidate prompts directly given only the training data, and then chooses the one with the best objective value. The prompt generation process is a best-of-N sampling, and each sampling is independent from another. In contrast, VML produces new prompt by asking an LLM to explicitly reflect on the old prompt and the corresponding predictions ˆy, then reasons on how to produce a prompt that can better predict the target y for the given input. This is conceptually similar to the distinction between gradient-free and gradient-based optimization. For more discussion and the implementation details of APE, please refer to Appendix D. Note that the qualitative comparisons in this section do not imply superiority between the two methods.
Figure 10: VML versus a prompt optimization method (Automatic Prompt Engineer [73]).
Linear regression as in Section 4.1. Figure 10(a) shows that the result from APE is vague and general. Such a description can easily be derived by humans through visual inspection of the data, and it does not learn deeper insights from the data, whereas VML is able to learn useful new information that is difficult to obtain by visual inspection. We can observe that VML automatically performs effective pattern summarization from data, which differs from naive prompt optimization.
Text classification. Adopted from the Google BIG-bench[6], the task is to classify whether a name is more likely to be associated to female or male. Figure 10(b) shows that APE does return a correct description of the task, but it is, once again, very general. Conversely, VML is able to learn more detailed knowledge about the data pattern which cannot be done easily through visual inspection.
Figure 11: VML is able to learn to reason and solve symbolically generated GSM8K [36] questions with Llama-3.1-8B.
4.9 VML Enables Robust Mathematical Reasoning
Recent work [36] shows that if we modify the original GSM8K [10] question by changing only the variable values (e.g., Figure 11(a)), the accuracy of many LLMs on the modified dataset will decline, which might be due to data contamination during pretraining. Our experiments show that VML can reduce such a performance variation and enable robust mathematical reasoning without changing the internal weights of
Figure 12: Some failure cases where the optimization was trapped in a bad local minima in the task of polynomial regression.
LLMs. Specifically, we randomly generate a training set and a test set, both of size 100 (without overlap), using the template in Figure 11(a). If we directly evaluate the test set using Llama-3.1-8B, the average accuracy over 5 runs is around 80%. We use VML to learn a set of instructions for this task, the initial one is given in Figure 11(c). We use a batch size of 10 and train for 10 steps. Figure 11(b) shows that, on average over 5 runs, the test performance increases with the number of training step, and VML enables the model to achieve achieves 98% accuracy on the test set. Figure 11(d) shows the optimization outputs for step 1, 3, and 6 for a selected run in Figure 11(b). We can see that after step 1, the new instructions already recover the correct mathematical reasoning for the task, but the test accuracy is only around 91%. At step 3, optimizerrealizes that the error is mostly due the inaccurate calculations rather than the correctness of the instructions for model
. Hence, the new instructions for model
includes emphasis on calculation verification, which brings the test performance up to 96%. At step 6, optimizer
says it notices many of model
’s mistakes are still related to incorrect and incomplete calculations, therefore, it breaks down the instructions into a more detailed lists to allow easier reasoning and checking. Therefore, VML can enable LLMs to improve their reasoning ability at test-time by themselves without changing the internal weights.
4.10 Failure Cases
Figure 12 shows a failure run for the polynomial regression task. Our log for this run records that after the first optimization step, the model are updated to a linear regression model, and the rest of the optimization steps are simply trying to fit this linear model to the data. Therefore, we can see the training dynamic plot show a fluctuating line, and the step with the lowest training loss (i.e., Step 19) still has a linear regression model. Unlike the other successful runs, where the optimizer realizes a quadratic function can be a better model class than a linear function, in this failure case the optimization clearly trapped in a local minima of a linear model. One possible way to reduce such failure case is to use more powerful LLMs. See Appendix C for more detailed discussions and three other examples where VML failed to learn a desirable model. Some of these failures can be avoided by changing the VML prompt template, while others could occur less frequently if we switch to a more powerful LLM.
Our paper introduces a verbalized way to perform machine learning and conducts several case studies on regression and classification tasks. The experiments show that VML can effectively perform these classical machine learning tasks with low-dimensional input data, validating the potential of LLMs as function approximators. Despite the empirical effectiveness, there are still limitations that remain to be addressed.
Large variance in learning. Training in VML still suffers from a relatively large variance. This is partially due to the stochasticity from the LLM inference, the capability of the LLM, as well as the prompt design of the optimizer. See Appendix C for failure cases and analyses.
Scalability in terms of data dimension, model parameter, and number of optimization steps. The input data dimensionality and batch size are largely limited by the size of context window in LLMs. In addition, when high-dimensional data are represented in raw text, current LLMs find it hard to grasp the information in the data, and therefore, it can lead to a poor performance in VML. Refer to Appendix B.6 for detailed experiments and discussions. Similarly, since VML is parameterizing a model in natural language space, the dimension of the learned model parameter is correlated with the number of tokens used in the text-based parameter. Hence, the complexity of the model parameter is largely limited by the size of context window in LLMs. Specifically, we empirically observe that the VML model parameters start with a simple model class and gradually shift to more complex model class during training. Due to the limited budget of querying LLMs and the main focus of this work being showcasing the concept of VML and its effectiveness in a diverse set of machine learning tasks, our tasks are all relatively small-scale, i.e., with less than 200 training data points for each task and the data points are usually low-dimensional (our experiments mostly use 2-dimensional data as the raw input to LLMs). Our experiments only have a small number of optimization steps (i.e., less than 100), which is sufficient for simple machine learning tasks. However, how to effectively scale VML with more training iterations, higher-dimensional data and more complex downstream tasks remains open questions.
The connection between LLMs and computers was discussed in our blog. The VML framework is naturally motivated by the idea of LLMs acting as a modern computer, since we view the verbalized model parameters as a way to “program” the LLM. We then connect VML to the von Neumann architecture in the sense that both data and program instruction are in the format of text prompt in VML.
WL was supported by the German Research Foundation (DFG): SFB 1233, Robust Vision: Inference Principles and Neural Mechanisms, TP XX, project number: 276693517. This work was partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy – EXC number 2064/1 – Project number 390727645. This work was supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A. RB acknowledges funding by the German Research Foundation (DFG) for project 448588364 of the Emmy Noether Programme. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Tim Z. Xiao. TX acknowledges support from G-Research’s PhD Grant Programme.
[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 4
[2] AlphaProof and AlphaGeometry teams. Ai achieves silver-medal standard solving international mathematical olympiad problems. DeepMind blog, 2024. 46
[3] Team AlphaProof and Team AlphaGeometry. Ai achieves silver-medal standard solving international 178 mathematical olympiad problems. DeepMind blog, 179, 2024. 47
[4] Jacob Andreas, Dan Klein, and Sergey Levine. Learning with latent language. In NAACL, 2018. 3
[5] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In NeurIPS, 2016. 2
[6] BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. 12
[7] Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024. 2
[8] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In NeurIPS, 2020. 46
[9] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In CVPR, 2024. 23
[10] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. 12
[11] Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric Xing, and Zhiting Hu. Rlprompt: Optimizing discrete text prompts with reinforcement learning. In EMNLP, 2022. 3
[12] Samuel J. Gershman and David M. Blei. A tutorial on bayesian nonparametric models. Journal of Mathematical Psychology, 2011. 43
[13] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In NeurIPS, 2021. 45, 46
[14] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 1997. 5
[15] Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023. 3
[16] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016. 4
[17] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017. 5
[18] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009. 5
[19] Andrei N. Kolmogorov. Three approaches to the quantitative definition of information. International Journal of Computer Mathematics, 1968. 43
[20] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In SIGOPS, 2023. 7
[21] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998. 5
[22] Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for "mind" exploration of large language model society. In NeurIPS, 2023. 3
[23] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022. 3
[24] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023. 3
[25] Ke Li and Jitendra Malik. Learning to optimize. In ICLR, 2017. 2
[26] Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An Huang, Ekin Akyürek, Anima Anandkumar, et al. Pre-trained language models for interactive decision-making. In NeurIPS, 2022. 2
[27] Zekun Li, Baolin Peng, Pengcheng He, Michel Galley, Jianfeng Gao, and Xifeng Yan. Guiding large language models via directional stimulus prompting. In NeurIPS, 2023. 3
[28] Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In ICRA, 2023. 2
[29] Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 4
[30] Shengchao Liu, Jiongxiao Wang, Yijin Yang, Chengpeng Wang, Ling Liu, Hongyu Guo, and Chaowei Xiao. Chatgpt-powered conversational drug editing using retrieval and domain feedback. arXiv preprint arXiv:2305.18090, 2023. 46
[31] Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023. 45
[32] Ruotian Ma, Xiaolei Wang, Xin Zhou, Jian Li, Nan Du, Tao Gui, Qi Zhang, and Xuanjing Huang. Are large language models good prompt optimizers? arXiv preprint arXiv:2402.02101, 2024. 3
[33] Eran Malach. Auto-regressive next-token predictors are universal learners. arXiv preprint arXiv:2309.06979, 2023. 46
[34] Mayug Maniparambil, Chris Vorster, Derek Molloy, Noel Murphy, Kevin McGuinness, and Noel E O’Connor. Enhancing clip with gpt-4: Harnessing visual descriptions as prompts. In ICCV, 2023. 3
[35] Sachit Menon and Carl Vondrick. Visual classification via description from large language models. In ICLR, 2023. 3
[36] Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229, 2024. 12
[37] Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. Slip: Self-supervision meets language-image pre-training. In ECCV, 2022. 3
[52] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. 47
[53] P. Vitanyi and Ming Li. Minimum description length induction, bayesianism, and kolmogorov complexity. In ISIT, 1998. doi: 10.1109/ISIT.1998.708951. 43
[54] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022. 3
[55] Larry Wasserman. All of nonparametric statistics. Springer Science & Business Media, 2006. 43
[56] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022. 3
[57] Yuxin Wen, Neel Jain, John Kirchenbauer, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. In NeurIPS, 2023. 3
[58] Jason Weston and Sainbayar Sukhbaatar. System 2 attention (is something you might need too). arXiv preprint arXiv:2311.11829, 2023. 3
[59] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023. 3
[60] Yaqi Xie, Chen Yu, Tongyao Zhu, Jinbin Bai, Ze Gong, and Harold Soh. Translating natural language to planning goals with large-language models. arXiv preprint arXiv:2302.05128, 2023. 2
[61] An Yan, Yu Wang, Yiwu Zhong, Chengyu Dong, Zexue He, Yujie Lu, William Yang Wang, Jingbo Shang, and Julian McAuley. Learning concise and descriptive attributes for visual recognition. In ICCV, 2023. 3
[62] An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024. 47
[63] Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. In ICLR, 2024. 2, 3, 5
[64] Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific Data, 2023. 11
[65] Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J Prenger, and Animashree Anandkumar. Leandojo: Theorem proving with retrieval-augmented language models. In NeurIPS, 2023. 46
[66] Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel Jin, Chris Callison-Burch, and Mark Yatskar. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In CVPR, 2023. 3
[67] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In NeurIPS, 2023. 3
[68] Yao Yao, Zuchao Li, and Hai Zhao. Beyond chain-of-thought, effective graph-of-thought reasoning in large language models. arXiv preprint arXiv:2305.16582, 2023. 3
[69] Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, and Songfang Huang. How well do large language models perform in arithmetic tasks? arXiv preprint arXiv:2304.02015, 2023. 14
[70] Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic "differentiation" via text. arXiv preprint arXiv:2406.07496, 2024. 3, 5, 24, 45
[71] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, 2023. 3
[72] Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022. 3
[73] Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910, 2022. 3, 5, 11, 12, 30
Table of Contents
L.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
L.2 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
A.1 Digit Pattern Discovery
Figure 13: Binary digit pattern discovery of 4-integer vectors. No prior is injected to the model parameter in this experiment.
To further demonstrate the interpretability of VML, we create a binary classification task on vectors of 4 digits. Class 0 contains vectors that only have digit ‘0’ in the first position, and Class 1 contains vectors that only have digit ‘0’ in the last position (see Figure 13(b)). Our dataset consists of 100 training data and 20 test data (half for both classes). Models are trained for 5 epochs (i.e., 50 steps with batch size 10). Figure 13(a) shows that both the training and test accuracy improves with the number of steps, hence learning is effective. The model is initialized with the definition of the task. During step 1, the optimizer says it notices that the first element of each input is often ‘0’ when the ground truth label is ‘0’, and decides to use a rule-based approach (see (c)). The resulting model description is half correct, which captures the pattern that ‘if the first element is 0, predicts 0’. After a few more steps, the optimizer is able to learn the correct description: ‘If the first element is 0, predicts 0. Otherwise, if the last element is 0, predict 1.’ Compared to the regression and 2D plane classification results, the learned model here is more interpretable than learning a neural network. Also, without any prior information, one will normally choose a universal approximator such as a neural network to solve this task, which will perform equally well but certainly not as interpretable. We also evaluate the performance of in-context learning (ICL) for this task as a baseline. Our result shows that VML is able to achieve 100% test accuracy with an interpretable description of the pattern, while ICL can only achieve 87.5% and does not explicitly output a pattern description.
A.2 MNIST Image Binary Classification
Figure 14: MNIST binary classification with InternVL2-Llama3-76B [9]. We group digit ‘4’ to class 0, and digit ‘9’ to class 1. No prior is injected to the model parameter in this experiment.
We create a binary classification task from the MNIST dataset. We assign digit ‘4’ to class 0, and digit ‘9’ to class 1. Following the same setup in Section 4.6, both of our training and test set has 100 MNIST images, half for each class. The model is trained for 5 epochs with batch size 10. We use InternVL2-Llama3-76B [9] for as the inference engine, which support image input. We also tried out other LLMs that support image input such as Claude, GPT-4o and Llama 3.2, but they have been finetuned to reject the digit recognition task, possibly due to some safety reason. Therefore, we do not have results for those LLMs. Note that the task can be easily solved if we directly instruct the LLM to classify digit ‘4’ to 0 and ‘9’ to 1. However, this is a knowledge from human inspection over the training set. In this toy MNIST example, such classification knowledge can be relatively easy to obtain by inspecting the images, but this way of obtaining knowledge is not scalable and requires massive human efforts if the image classification problem is complex. In contrast to directly instructing LLM to perform classification tasks, VML can automatically discover the pattern behind the training data and learn to perform classification without human interference. In this MNIST classification task, we show that VML can easily learn this classification instruction without any human prior knowledge.
Here we provide additional details for the experiments in Section 4.7 and additional ablations.
B.1 Comparison Between In-context Learning and VML
In-context learning (ICL) is a popular method for adapting LLMs to downstream tasks. Here, we compare the performance of VML and ICL in various tasks from previous sections. For all tasks, we provide the entire training set as in-context examples, and query the individual test data independently. The resulting predictions for regression and 2D classification are plotted in Figure 15. The full comparison between VML and ICL are shown in Table 2. We can see that VML outperforms ICL in regression and medical image classification, and has the same performance to ICL in the simpler classification tasks, e.g., two blobs and two circles. Within our framework, ICL can be understood as a nonparameteric method, while VML is a parameteric one (see Appendix J.3 for more discussion).
Figure 15: Predictions of in-context learning (ICL) for the same regression and classification tasks with Llama-3 70B.
Table 2: Test performance for in-context learning (ICL) and verbalized machine learning (VML) on various tasks from previous section (without adding prior information). The ICL results are chosen from the best across 5 runs. The metrics used for regression (Reg) and classification (Cls) are mean square error (MSE ) and test accuracy (
) correspondingly.
B.2 Larger and More Powerful LLMs Learn Faster and Better
To verify whether the performance of VML scale with the capability of LLMs, we compare three Llama-3.1 models of different sizes, i.e., 8B, 70B, and 405B, in the linear regression setting. Figure 16 shows the training loss of 5 individual runs (thin) and their mean (thick) for each LLM. Note that due to the high variance nature of using LLMs for optimization, we select the 5 best runs out of 10 runs for this comparison. We see that more powerful LLMs (e.g., 405B) learn faster and achieve lower training loss.
B.3 Direct and Indirect Optimization
There are different ways to implement the optimization step in VML. We choose to directly update the model parameters in a single LLM call by providing all the necessary information, i.e.,
in Algorithm 1. If we choose a lower abstraction level, we can decompose the direct single step optimization into indirect multi-step optimization. Algorithm 2 illustrates how
can be decomposed into four consecutive functions, which resemble the operations of computation graphs in most numerical machine learning frameworks. Specifically, we calculate the following step-by-step: (1) the quality of the predictions (i.e., evaluate the loss function
); (2) the ‘gradient’ of the loss
w.r.t. the predictions ˆy denoted as
; (3) the ‘gradient’ of the loss
w.r.t. the parameters
denoted as
; (4) update the current
using the ‘gradient’
. The ‘gradients’ here are known as ‘textual gradients’ in prompt optimization literature [41, 70], which are essentially text-based feedback from LLMs.
Figure 16: Llama-3.1 LLMs scale versus VML training performance in linear regression setting. 5 individual runs (thin) and mean (thick) for each LLM.
We compare the two approaches in the linear regression setting using Llama-3.1 70B. Figure 17 shows, for both the direct and indirect optimization, the training loss of 5 individual runs (thin) and their mean (thick). We can see that the indirect method performs slightly worse than the direct method. The reason can be there are 3 more prompt templates to design, which is harder than designing just one, and has a higher risk of losing information in the pipeline.
Figure 17: Training loss of direct and indirect optimization in linear regression setting using Llama-3.1 70B. The lines show 5 individual runs (thin) and mean (thick) for each approach.
B.4 Evaluations on a diverse set of LLMs
Most of our experiments in Section 4 are done with Llama-3-70B. In this section, we evaluate various LLMs other than Llama-3-70B on four tasks from Section 4, including linear regression (Reg-Linear), polynomial regression (Reg-Poly), two blobs classification (Cls-Two Blobs), and two circles classification (Cls-Two Circles).
Table 3: Test performance for VML using various LLMs on regression and classification tasks from previous section (without adding prior information). The last two tasks are done with prompts in Chinese. For each setting, we include results for 5 runs, and we highlight the runs with worse performance than the ICL English best@5 baselines (see Table 2) in red.
As for LLMs, we use two proprietary LLMs (Claude-3.5-Sonnet and GPT-4o), and two open-source LLMs (DeepSeek-V3 and Qwen2.5-72B-Instruct). The experiments are done in the same setting as in Table 2, i.e., 2 epochs of training for regression and 5 epochs of training for classification.
Results in Table 3 show that all four LLMs are able to perform well in the same settings. In particular, when comparing with the best@5 results using Llama-3-70B in Table 2, all four LLMs here have better performance in Reg-Linear and Reg-Poly, which might due to the fact that these four LLMs are new and more capable than Llama-3-70B. As for the two classification tasks, all four LLMs matches the performance of Llama-3-70B in Cls-Two Blobs, and three out of the four outperform Llama-3-70B in Cls-Two Circles.
Overall, other than the fact that more powerful LLMs can learn faster and achieve lower loss (similar finding as in Section 4.7 and Appendix B.2), there is not too much difference between them (no matter being proprietary or open-source). Therefore, if cost is not a constraint, one should always choose the most powerful LLMs for doing VML.
B.5 Using Language Other Than English
Our experiments in Section 4 are all done using English. In this section, we provide experiments using Chinese only, i.e., the learner and optimizer templates, and the model initial parameters are all in Chinese. We construct these Chinese prompts by first asking ChatGPT for a translation of the existing English version, then asking a native Chinese speaker to verify the translation. We replace the original English prompt with the Chinese version, and run the same VML algorithm for linear regression (Reg-Linear) and polynomial regression (Reg-Poly) using the same setting as in Table 2. We use four different LLMs for the experiments, two propitiatory (Claude-3.5-Sonnet and GPT-4o), and two open-source (DeepSeek-V3 and Qwen2.5-72B-Instruct).
The bottom two sections in Table 3 show the results. We can see that the performance is slightly worse than the English version (the top two sections in the same table) across all LLMs. But the results are still better than the ICL English best@5 baselines in Table 2. This is likely due to most of the LLM developers putting their efforts into the English corpus and English benchmarks, which highlights a weakness of the existing models.
B.6 High-dimensional Two Blobs Classification
The two blobs classification task in Section 4.4 has only two feature dimensions for each data point. In this section, we extend the same task to higher numbers of data dimensions, from 2-D to 10-D. Note that these high-dimensional data are represented in raw text and are processed by the text encoder of an LLM during VML. This is different from the data used in medical image classification (i.e., images in Section 4.6), which are also high-dimensional data, but they are processed by the image encoder of a vision-language model rather than the text encoder.
The experiments are done with the same setting as in Section 4.4 but with different number of feature dimension. Table 4 shows the best test accuracy out of 5 runs, and the average test accuracy over the 5 runs. We can see from the results that with Llama-3-70B, VML straggles to perform well when the data dimensions are larger than 7-D. There are a few possible explanations, including: (1) the current LLMs are not trained to understand the text representation of high-dimensional data; (2) the current LLMs do not handle long context well, they cannot grasp the information in high-dimensional data.
Table 4: Test performance for high-dimension two blobs classification using Llama-3-70B. The data points live in the corresponding n-D space on a 2-D hyperplane. The table shows the best results out of 5 runs, as well as the average performance over the 5 runs.
B.6.1 How should VML handle high-dimensional data?
The ability of VML for handling high-dimensional data is mainly constrained by the inference backbone, i.e., LLMs. The experiments in this section together with the experiments in Section 4.6 demonstrate two different approaches to handle high-dimensional data, i.e., either representing the data in raw text and processing them with the text encoder of an LLM, or representing the data in image and processing them with the image encoder of a vision-language model. We see that if there is a corresponding encoder for the high-dimensional data (e.g., image), VML can handle tasks that involve these high-dimensional data easily.
Of course, we can also finetune a text-only LLM to handle high-dimensional data in raw text, which might improve the performance of the corresponding VML task. However, if we think about how humans handle high-dimensional data, we know that humans rarely do inference directly on the raw text representation of high-dimensional data, which is very difficult. For many high-dimensional data such as sound and images, humans have dedicated encoders to process them. For those that do not have dedicated encoders (e.g., radio wave), humans often use tools to preprocess the data into the representation that can be easily encoded, then do inference afterwards.
This points us towards a possible path to improve VML’s ability for handling high-dimensional data. First, we need to develop better inference engines to support more data modalities, and make sure they can reason in those modalities well. Second, we need to improve the inference engines’ ability in tool calling so that they know when to use existing tools (such as Python) to preprocess the data if they find them too difficult to reason about in the current format.
Figure 18: Four examples of failure runs in VML. (a) polynomial regression; (b) medical image classification; (c) linear regression; (d) polynomial regression. For (b) and (c), we only show the learned model, as the training dynamic is less relevant for these two cases.
In this section, we show case four different examples where VML failed to learn a desirable model. Some of these failures can be avoided by changing the VML prompt template, while others could occur less frequently if we switch to a more powerful LLM.
Trapped in a local minima. Figure 18 (a) shows a failure run for the polynomial regression task. Specifically, this corresponds to the run 3 in Table 3 Reg-Poly (English) with Qwen2.5-72B-Instruct. Our log for this run records that after the first optimization step, the model are updated to a linear regression model, and the rest of the optimization steps are simply trying to fit this linear model to the data. Therefore, we can see the training dynamic plot show a fluctuating line, and the step with the lowest training loss (i.e., Step 19) still has a linear regression model. Unlike the other successful runs, where the optimizer realizes a quadratic function can be a better model class than a linear function, in this failure case the optimization clearly trapped in a local minima of a linear model. One possible way to reduce such failure case is to use more powerful LLMs. In the same Table 3 Reg-Poly (English), we can see that when we use more powerful GPT-4o, all five runs have much lower test loss (i.e., 10) than this failure case (i.e.,
400).
Not describing the data pattern. Figure 18 (b) shows a failure case for the medical image classification task in Section 4.6. In this run, instead of a semantic pattern description for the two classes of images (e.g., Figure 8), the optimizer returns a description of a two layer neural networks (without exact values for the weights) and the training procedure. Using this description to do inference directly on the input image will undoubtedly lead to a useless answer. The model description after the next update is not showed in Figure 18, but our log shows that, due to the expected poor performance of the current description, the optimizer proposes to increase the number of layers in the neural networks to make it more capable. One way to avoid such failure is to add instructions in the prompt template of the optimizer specifying, for example, “the new model description should be a decision rule which must base on the features in the input image”.
Missing crucial information when describing a parametric model. Figure 18 (c) shows a failure case for the linear regression task in Section 4.1. There are two issues with this learned model. One is that the function in proposed by the optimizer is too complex for a linear regression task, which indicates overfitting, i.e., trying to fit all the data perfectly. The other more significant issue, which directly leads to the failure of evaluating the model on any given data point, is that the function in the description consists parameters with unknown values. Such a function is not fully defined, therefore, the inference error will be large. Similarly, we can avoid such failure by adding instructions to the prompt template of the optimizer specifying, e.g., “must provide the exact value of the parameters if the description potentially involve unknown or learnable parameters”.
Aggressive fitting. Figure 18 (d) shows a failure run for the polynomial regression task. Specifically, this run corresponds to the run 2 in Table 3 Reg-Poly (English) with DeepSeek-V3. We can see from the figure that after the first step of optimization, the learned model is already a quadratic function, and it is quite close to the ground truth. However, as the training progresses, the learned model deviates from the ground truth model class and becomes a more complex function, which does have a low training loss but cannot extrapolate outside of the training data distribution (i.e., for 1). This is similar to cases where the learning rate is too high in classical machine learning, which causes the optimizer to escape from a good local minima and end up in a worse solution. This failure happens less frequently for more powerful LLMs. as we can see from Table 3 Reg-Poly (English) where GPT-4o has 0 failure run out of 5.
D.1 APE Experiments Details
For our APE experiments in Section 4.8, we use the code from the authors’ GitHub repo. Unlike in the APE paper [73] which uses GPT-3 as the LLM, here we use Llama-3-70B. Note that our VML experiments in Section 4.8 are also done with Llama-3-70B for a fair comparison.
Figure 19: Prompt templates used for our APE experiments.
The workflow of APE mainly has two steps that rely on LLM calls. The first step is to use the provided data in batches to construct proposal queries to sample a set of possible candidate prompts (see Figure 19(left) for the template we used). The second step is to evaluate each candidate prompt with the data (see Figure 19(right) for the template we used), and choose the best candidate based on some metric. We use a general metric for our experiments, which is the likelihood of a candidate prompt.
There are a few hyperparameters for the APE algorithm. We tried out different batch size for the proposal queries, and we choose batch size 5 at the end. Another important hyperparameter we can set is max_tokens, the maximum number of tokens allowed in the completion. We tried both 50 and 500. The prompts we show in Section 4.8 Figure 10 are results for setting max_tokens to 50. This gives us the most concise and reasonable prompts, but due to the hard cutoff at length 50, the prompt can be incomplete. If we allow a longer response by setting max_tokens to 500, it is still possible to have incomplete candidate prompts. At the same time, these longer prompts are often worse, as we know the ground truth prompt (or pattern description) is around one or two sentences. See Figure 20 for the result of the same text classification task in Section 4.8 but with max_tokens being 500.
D.2 Differences between APE and VML
Even though APE and VML are both trying to optimize a prompt towards a certain target, there are fundamental differences between the two. We provide pseudo code algorithms for each of them here to compare their differences. When applying APE’s algorithm (see Algorithm 3) to a learning problem, it can be summarized with two steps. First, we generate a set of candidate prompts from the training data by letting LLMs compete the instruction needed to produce the x, y pairs. Then, we use a score function to rank each candidate prompt, and choose the best one. In the case of VML (see Algorithm 4), we start with an initial prompt, and we infer the corresponding ˆy for each x in the current batch of training data. Then, we ask an LLM to generate a better prompt that can explain the current batch of x, y pairs taking into account the ˆy produced by the current prompt, and we iterate the same process on next batch of training data until convergence. We highlight two distinctions between APE and VML below.
‘Gradient’-free v.s. ‘Gradient’-based. In APE, the candidate prompts are sampled directly given only the training data (see Figure 19 (left) for the template), then the one with the highest score is selected within the set. In VML, the prompts are generated by asking an LLM to explicitly reflect on the last prompt and the corresponding prediction ˆy, then propose a new prompt that can better predict the target y for the given x. This process requires the optimizer LLM to explicitly reason the following: why does the last prompt produce the current ˆy; how to modify the last prompt to minimize the prediction error; what is a better description for the relation between x and y. If we use the language from classical machine learning, we
Figure 20: Resulting prompt from APE for the text classification task with max_tokens being 500.
can say that the optimization in VML makes use of the ‘gradient’ information from the last prompt and the current batch of training data, while the optimization in APE is ‘gradient-free’.
Numerical score function v.s. Self-evaluation. Another important distinction is that APE requires a predefined score function that can give a numeric score to each candidate prompt. For example, one can use the log-likelihood of a prompt as the score, which normally requires access to the weights of an LLM, hence it is only possible for the open-source models. In contrast, VML does not require such a score function to evaluate the prompt. Evaluation of a prompt in VML is done by the LLM itself purely in natural language. This is more flexible and agnostic to different LLMs (e.g., proprietary or open-source).
Figure 21: Training dynamics for two different optimization settings in the polynomial regression setting. One has access to the accurate loss computation, and the other does not.
The VML algorithm at Algorithm 1 specifies that the arguments for ) consist of the inputs x, the predictions ˆy, the targets y, the current model parameter
and the optimizer configurations
. Hence, there is no explicit definition of the loss function for the optimizer (see Figure 2(right) for an example of the verbalized loss function). It is up to the optimizer itself to evaluate the difference between the prediction ˆy and the target y. We are interested in question that whether having access to the real training loss (defined and computed for logging purpose), mean squared error in this case, can help the optimizer to better navigate the training trajectory.
The orange line in Figure 21(c) shows that having such accurate loss feedback might not help, and might even decrease the performance in this scenario. One possible explanation is that the single loss value itself does not contain too much information. Moreover, as the exact form of the loss function can be fed to LLM easily, the LLM might spend additional efforts to estimate the exact form of the loss function, which makes the convergence even more difficult. It actually makes intuitive sense that verbalized loss function (i.e., using natural language to explain the target of the loss function) works better in the VML framework. For example, knowing how does each prediction contributes to the loss value can be more informative and a single overall loss value, since the model might be doing well for some data but not the others, and we only want to improve the model for points with the bad predictions.
Figure 22: Functions evaluations and numerical error in Llama-3 70B
LLMs are designed to do language modeling, rather than exact calculations. Hence, their performance on evaluating functions can be unreliable, and might result in error. Figure 22 shows that Llama-3 is very comfortable in evaluating the given linear and polynomial function, as the mean is quite accurate. The variance over 10 runs is also pretty small, except for one or two points. However, for a more complex function such as sin(x), Llama-3 is only able to return small error approximately in the range of 2). Both the error and the variance are large out side of this range. This explains the non-smoothness for the function in Figure 5(b; right), which has sin(x + 1.0) in the learned model parameters.
By switching to the more powerful model, GPT-4o, we can see from Figure 23 that both the error and the variance decrease. In particular, for sin(x), GPT-4o returns smaller error in a larger range, (i.e., 0)). This implies that as the capability of LLMs improves, their performance in evaluating more complex functions also improves.
Nevertheless, this is currently still a limitation for VML if the optimizer chooses to use complex mathematical functions as the model parameter. If the evaluation of the function has an error, then during training, the optimizer will update the model parameters based on noisy signal. This can lead to large variance in training and slow convergence. Future work should look into methods for minimizing the numerical error in LLMs function evaluation.
After applying the sine function to the input, you always add 2 to the resulting value. This shifts the sine wave vertically by 2 units. x
Figure 24: Function evaluations based on the natural language description of the corresponding symbolic sine function.
Figure 24 shows that if we use natural language to describe the symbolic sine function (see sub-figure(a)), GPT-4o is able to produce more accurate evaluations than using the symbolic function (see (c)). The accuracy of Llama-3 70B also increases, even though it still under performs GPT-4o (see (b)). This is likely due to Llama-3 is less capable in instruction following than GPT-4o. This observation implies that in VML, we might want to instruct the optimizer to avoid using complex symbolic functions in the update and to prefer the natural language description of the function.
In this section, we supplement experiments of Llama-3 70B with a python interpreter. Despite the fact that LLMs are able to perform numerical data tasks, the incorporation of a python interpreter further improves LLMs ability to deal with numerical values. Specifically, we use the open-interpreterlibrary to add a python interpreter to Llama-3 70B, such that the LLM has the ability to use python programs to evaluate symbolic functions or perform numerical operations. We follow the same experimental settings as in Section 4.3 (sinusoidal regression of y = sin(x)+2). The training data is only sampled from [
3] with additive Gaussian noise. The in-domain testing data is sampled from the same range, while the out-of-domain testing data is sampled from [
3] and [3, 6].
Table 5: Evaluation (using mean squared error ) on sinusoidal regression as in Figure 5(b) for three different models including (1) neural networks, (2) Llama3 with prior, and (3) Llama3 with prior and code interpreter.
From the table, we can observe that with the python interpreter, Llama-3 70B can effectively learn periodic functions, while in the original experiment (i.e., Figure 5(b)), the same LLM is unable to approximate periodic function even with a prior. The results show that the tool-using ability can further improve the learnability of VML. The example logs for inference with the learned model is showed below.
H.1 From Vague to Concrete Model Parameters
The model parameters generated by a VML optimizer can be vague or concrete. We are curious for those with vague descriptions, how would the LLM evaluations look like, and whether they have large variance. Figure 25 shows the results on Llama-3 70B for six different model descriptions, including:
(a) shows that if we only provide the information that the task is a regression task and do not specify the model at all, the LLM tends to predict a linear function (slope 1) with increasing variance as x moves away from 0. (b) shows that if we specify there is a linear relationship between inputs and outputs, the LLM will predict a linear function with a similar slope as (a) but with smaller variance. (c) shows that if we specify the explicit form of the linear function, the slope will still be around 1, but the variance are larger when x > 1. (d, e, f) show that by providing a range for the values of the unknown variables, the LLM tends to use the mid-point of the range for the values, and a smaller range does correspond to a smaller variance in prediction.
Moreover, LLMs have shown remarkable potential in numerical data tasks for machine learning, and our work is one of the first methods to reveal such a potential. Some concurrent works [46, 70] also gave empirical evidence that LLMs can be fundamentally suitable for machine learning tasks.
Verbalized machine learning aims to provide a framework for LLMs to deal with machine learning tasks, with the ability to fully interpret the learned knowledge with natural language. We believe this framework will be increasingly more powerful, as LLMs get more powerful. We have already observed the performance improvement of VML by switching from Llama-3 to GPT-4o.
Hallucination in is when the optimizer does not produce a sensible model parameter for the current step of training. From our experiment section, we can indeed see that many of our case studies have a non-monotonic training loss, some of the fluctuation can be explained by hallucination (see failure cases in Appendix C for example). However, since hallucination is a low probability event, an unsatisfied model
L.1 Linear Regression
Figure 27: Prompt templates of VML for the learner and optimizer for the linear regression (Llama-3-70B without prior).
L.2 Polynomial Regression
Figure 28: Prompt templates of VML for the learner and optimizer for the polynomial regression (Llama-3-70B without prior).
L.3 Sinusoidal Regression
Figure 29: Prompt templates of VML for the learner and optimizer for the sinusoidal regression (GPT-4o with prior).
L.4 Two Blobs Classification
Figure 30: Prompt templates of VML for the learner and optimizer for the two blobs classification (Llama-3-70B without prior).
L.5 Two Circles Classification
Figure 31: Prompt templates of VML for the learner and optimizer for the two circles classification (Llama-3-70B with prior).
L.6 Text classification
If the model is doing well, you can keep using the current descriptions. However, if the model is not performing well, please update the model by improving the ' ew Model Descriptions', which should have lower classification error both on the current and the next batch of i.i.d. data. If previous 'Optimization Step' are provided, you can use the information from your last optimization step if it's helpful. Please think step by step and give your outputs strictly in the following format:
Figure 32: Prompt templates of VML for the learner and optimizer for the text classification (Llama-3-70B without prior).
M.1 Linear Regression (Llama-3-70B without prior)
M.2 Polynomial Regression (Llama-3-70B without prior)
M.3 Sinusoidal Regression (GPT-4o with prior)
M.4 Two Blobs Regression (LLama-3-70B without prior)
M.5 Two Circles Regression (LLama-3-70B without prior)
M.6 Two Circles Regression (LLama-3-70B with prior)
M.7 Text Classification (LLama-3-70B without prior)
M.8 Medical Image Classification (GPT-4o with prior)
M.9 Medical Image Classification (GPT-4o without prior)