knowledge, and it can update the model class during training; and (3) interpretable learner updates: the LLM-parameterized optimizer can provide explanations for why an update is performed. We empirically verify the effectiveness of VML, and hope that VML can serve as a stepping stone to stronger interpretability.
— Ludwig Wittgenstein1 INTRODUCTION
The unprecedented success of large language models (LLMs) has changed the way people solve new problems in machine learning. Compared to conventional end-to-end training where a neural network is trained from scratch on some curated dataset, it has become increasingly more popular to leverage a pretrained LLM and design good prompts that contain in-context examples and effective instructions. These two ways of problem-solving lead to an intriguing comparison. Traditionally, we would optimize a neural network in a continuous numerical space using gradient descent, while in the new approach, we optimize the input prompt of an LLM in a discrete natural language space. Since a neural network is effectively a function parameterized by its numerical weight parameters, can a pretrained LLM act as a function that is parameterized by its natural language prompt?
Driven by this question, we conceptualize the framework of verbalized machine learning (VML),
VML is that we can define a machine learning model using natural language, and the training of such a model is based on the iterative update of natural language. This framework enables many new possibilities for interpretability, as the decision rules and patterns learned from data are stored and summarized in natural language. Specifically, we propose to view the input text prompt of LLMs as the model parameters that are being learned. However, optimization over such a natural language parameter space also introduces additional difficulties. Inspired by previous work [3, 22] where the optimizer is viewed as a function parameterized by a neural network, we parameterize the optimizer function as another LLM, which produces the next-step model parameters by taking in the current
optimizer LLM to update the learner LLM iteratively such that the training objective can be reached.
Compared to conventional numerical machine learning, the VML framework brings a few unique advantages. First, VML introduces an easy and unified way to encode inductive bias into the model. Because the model parameters are fully characterized by human-interpretable natural language, one can easily enter the inductive bias using language. This linguistic parameterization makes machine learning models fully interpretable and adjustable. For example, if the input and output data are observed to be linearly correlated, then one can use this sentence as part of text prompt. How to effectively encode inductive bias is actually a longstanding problem in machine learning, and VML provides a unified way to inject the inductive bias through natural language—just like teaching a human learner. Second, VML performs automatic model selection during the learning process. The optimizer LLM can automatically select a suitable model class based on the training data and verbalized prior knowledge. Third, each update of the model is fully interpretable in the sense that the optimizer LLM can give an explanation of why it chooses such an update. One can even interact with the optimizer LLM in order to inject new prior knowledge or obtain detailed reasoning.
VML can be viewed as a natural generalization of in-context learning (ICL). Specifically, ICL is a single-step implicit learning process, while VML is a multi-step iterative learning process where the in-context examples are summarized into verbal pattern and knowledge. Moreover, VML provides a way of scaling inference-time compute [5, 44]. Compared to the best-of-N re-sampling, VML iteratively updates its model parameter prompt by taking into account the learner’s past predictions.
numerical machine learning, language models in VML do not differentiate data and model, and treat both of them as part of the text prompt. This shares a striking connection to stored-program
as data rather than wiring setups. The link between language models and stored-program computers underscores the importance of text prompts, which play a similar role to computer programs, and,
• We formulate the framework of verbalized machine learning, where pretrained language models are viewed as function approximators parameterized by their text prompts. Then, we revisit some classical machine learning problems and show that VML is able to solve them.
• We design a concrete VML algorithm with a text prompt template. This algorithm parameterizes both the learner model and the optimizer as LLMs, and enables the iterative verbalized training.
• We conduct an empirical study for the injection of verbalized inductive bias and show that it is promising to use natural language as a unified way to encode prior knowledge.
• We validate the effectiveness of VML in different applications (Section 4, Appendix B,C,D,E).
agents [45, 55, 23, 25], such that they can follow natural language instruction to complete complex tasks. More recently, LLMs have been used to solve optimization problems [57]. Specifically, the LLM generates a new solution to an optimization problem from a prompt that contains previously generated solutions and their loss values. The LLM optimizer in [57] shares a high-level similarity to our work, as we also aim to solve an optimization problem with LLMs. The key difference to [57] is our function approximation view of LLMs, which enables us to revisit classical machine learning problems and solve them through natural language in the VML framework.
Natural language to facilitate learning. [41, 20, 21, 33, 65] show that natural language captions serve as an effective supervision to learn transferable visual representation. [31, 34, 36, 30, 60, 56] find that natural language descriptions can easily be turned into zero-shot classification criteria for images. [2] proposes to use natural language as latent parameters to characterize different tasks in few-shot learning. In contrast to prior work, VML directly uses the text prompt of LLMs to parameterize functions and learns the language-based model parameters in a data-driven fashion.
Prompt engineering and optimization. There are many prompting methods [51, 66, 67, 49, 61, 62, 53] designed to elicit the reasoning ability of LLMs. To reduce the hand-crafting efforts in designing good prompts, automatic prompt optimization [66, 67, 57, 37, 52, 8, 24, 28, 46] has been proposed. Unlike prompt optimization where the text prompt is optimized without changing its semantic meaning, VML updates its language-based model parameters by adding or modifying the model prior information, making the learner model fully interpretable about its prediction.
LLMs for multi-agent systems. Due to the strong instruction-following ability, LLMs are capable of playing different roles in a multi-agent systems. [38, 54, 12, 19] study a multi-agent collaboration
system where one LLM plays the role of learner and the other LLM plays the role of optimizer.
Figure 1: A comparison between numerical machine learning and VML.
Classical machine learning models (e.g., neural networks) are typically trained in a numerical and continuous parameter space. Once trained, these models are stored as a collection of numbers that are not interpretable and remain a black box. Motivated by the strong universal problem-solving capability of LLMs, we find it appealing to view an LLM as a function approximator parameterized by its text prompt. This perspective leads to the VML framework. Similar to a general-purpose modern computer whose functionality is defined by
its running program, a function that is defined by an LLM is characterized by its text prompt. Due to the fully human-interpretable text prompt, the VML framework provides strong interpretability and is also easy to trace the cause of model failure. Figure 1 gives a comparison between numerical
token-based format, while numerical machine learning treats data and model parameters differently.
VML parameterizes a machine-learning model with natural language. More formally, VML places a strong constraint on the model parameters to exchange for interpretability, where is a text token sequence, is some text token from a large token set A, and denotes the set of all natural language sequences that humans can understand. The model parameter space in VML has the following properties: (1) discrete: the natural language space is discrete; (2) sequential: the natural language space is sequential, and the next word is dependent on its previous words. In contrast, the parameter space in numerical machine learning is not sequentially dependent; and (3) human-interpretable: the natural language that characterizes the model is human-interpretable. More discussion is given in Appendix G.
One of the most significant advantages to use natural language as the model parameters is the easy
one can observe and understand what gets added and what is modified. Our empirical evidences also supports our interpretability claim, as we find that the model parameters are typically a language description of the underlying pattern that the model discovers from the training data.
by its natural language prompt. Specifically, we denote the language model as where x is the input data and is the function parameter. Both x and are represented with text tokens. In VML, is typically a frozen language model that is pretrained on a large corpus of text (e.g.,
LLM as zero, which theoretically makes the output deterministic. If we set the temperature high (see Appendix F for more discussion), can be viewed as sampling a value from some distribution.
We revisit how a classical machine learning problem is formulated in the VML framework. Suppose we have N data points in total, where is the data vector and is the target value. As an example, we consider a least square regression problem using the LLM-parameterized function:
Gumbel-softmax [13]) is typically known to be sample-inefficient and sub-optimal.
Because the model parameters in VML form a text prompt, optimizing is effectively a prompt optimization problem. Different from the prompt optimization problem [67], where the goal is to
Figure 2: An overview of iterative optimization and text prompt templates of the learner and the optimizer in the regression example.
produce a generic prompt without adding new information, the training in VML focuses on updating
the modification of existing information. To optimize our model parameters, we start by looking at the gradient of the regression objective function in Equation 1:
where is the learning rate, and the constraint is to
human-interpretable natural language space. It seems to be infeasible to compute this gradient. To address this, we view the gradient as a function of the data (x, y) and the current model parameters . Then we directly approximate the next-step model parameters using another pretrained language model denoted by
use language to specify the update speed, the momemtum, etc. The largest possible batch size of the optimizer LLM is determined by its context window. The optimizer LLM can already output natural language that satisfies the constraint, so we simply ask the LLM to play the optimizer role, which has been shown quite effective in [57]. More importantly, the performance of our VML framework gets better as the instruction-following ability of LLMs gets stronger. An overview of the iterative
given in Figure 2. The detailed algorithmic training procedure is given in Algorithm 1.
Using an LLM as the optimizer offers several unique advantages. First, the optimizer can perform automatic model selection. When the learner model can not make correct predictions for the training data, the optimizer will automatically update the learner to a more complex and capable model (see the polynomial regression experiments in Section 4.2 as an example). Second, the optimizer can provide detailed explanations of why a particular update should be performed, which helps us to
also allows us to inject prior knowledge to improve optimization (even during training).
Different optimizer parameterizations. Here we use a direct parameterization, i.e., parameterizing the optimizer as a single function , which couples the gradient and the update functions together. Alternatively, we can use an indirect parameterization where the gradient and the update are two separate LLM-parameterized functions. The gradients are known as “textual gradients” in prompt optimization [37, 64]. The update of learner’s model parameter is given by , where is computed by and similarly, is computed by . Both and are parameterized by LLMs. Compared to direct parameterization that takes one LLM call, this process takes several LLM calls. We compare both methods in Section 4.8 and Appendix A.3.
VML as a unified framework to encode inductive bias. A unified framework to encode arbitrary inductive bias has been pursued for decades. For different types of data, we need to design different models to encode the inductive bias (e.g., graphical models [15] for random variables, recurrent networks [11] for sequences, graph networks [14] for graphs, and convolution networks [18] for images). VML uses a unified natural language portal to take in inductive biases, making it very flexible for encoding complex inductive bias. To incorporate an inductive bias about the hypothesis class or prior knowledge about the problem, we can simply concatenate a system prompt (i.e.,
the final model parameters are where only is learnable and is provided by users.
Difference between VML and prompt optimization. Both VML and prompt optimization aims to automatically produce a text prompt towards some target, but VML differs from existing prompt optimization works (e.g., [67, 37]) in a substantial way. First, VML aims to automatically discover a data pattern description that acts as the the model parameters for the LLM learner, while prompt optimization seeks a generic instruction without changing the original meaning to elicit the best downstream question-answering performance. We qualitatively compare the difference of their learned prompts in the experiment section. Second, prompt optimization can be viewed as a building block for VML, as its techniques can be naturally adapted for the training of VML.
VML enables interpretable knowledge discovery. Because the model parameters are already in natural language, it is easy to understand the underlying pattern that leads to the prediction and the
discover novel knowledge that humans can also learn from.
VML as “the von Neumann architecture” in machine learning. Machine learning usually treats the model parameters and the data differently, similar to the Harvard architecture that stores instruction and data separately. VML stores both data and model parameters in the text prompt as tokens, which resembles the von Neumann architecture that stores instruction and data in the same memory.
We demonstrate the features and advantages of VML by revisiting some classical machine learning tasks followed by a realistic medical image classification task. In these tasks, we are given data , and we want to find such that best describes the mapping . Our experiments below show in detail how VML is able to solve these tasks and find
Experiment setups. We use the instruction-tuned Llama-3 70B [47] for the LLM unless specified otherwise. The training set for each task consists of 100 data points. For all tasks, we use a batch
steps per epoch of training. To evaluate regression performance, we look at the training loss, and the model predictions in both the interpolation and extrapolation settings. As for classifications, we use additional test sets consist of 20 data points, and evaluate the training and testing accuracies. During optimization, inspired by the idea of momentum from classical machine learning optimization, we also provide the last step (i.e., one step only) of the optimization history to stabilize training.
Training logs. The results of our experiments are showed using: (a) training loss, which is computed by parsing the model output (string) and converting it in to the same data type as the target value (y), then we use mean squared error for regression, and zero-one loss mean (i.e., average accuracy) for
VML (see Algorithm 1).; (b) visualization of the learned model, which is also done through parsing and converting the model output; (c) the model parameter at each training step i before optimization (i.e., ), and the optimizer output for the updated . For i > 1, the full model parameter before optimization is , but in our figures below we only show the to save space.
Figure 3: Training dynamics for VML based linear regression. The model is trained for 2 epochs, each with 10 steps.
Compute. The LLM is ran on a node of using the inference engine provided by vLLM [17].
over a batch, and 1 time for requesting the newly optimized . We also evaluate the entire test set at each step, which, depending on the size of the evaluation set, requires between 20 to 100 LLM queries. Overall, for the regression tasks, they take around 10 minutes for each epoch of training. The classification tasks, take around 16 minutes for each epoch of training. An additional 6-minute overhead arises due to evaluating the grid for the background of the decision boundary.
task from R to R (see Figure 3(c) Step 1). Figure 3(a) shows that training improves the model, and that it converges. The subplots (b) and (c) show details of the model and optimization at steps 1, 3 and 15. At step 1, since only contain the definition of 1-D regression task, the modelis randomly guessing (see the dashdot line). The optimizersays that it notices a linear relationship between the input and the target outputs, hence introducing a linear regression model to capture such a relationship, which results in modelbeing a straight line. From step 2 onward, the optimization focus switches to fitting the identified linear regression model to the data. For example, at step 3, we can see that optimizersays it notices that the outputs of modelare generally smaller than the target, suggesting the scaling factor is too small, hence it increases it. Similarly, at step 15, optimizeralso says it notices the modeloverestimates the target; hence, it reduces the scaling factor. We can see from (b) that the resulting modelclosely approximates the ground truth.
and . Similarly, is initialized by only specifying that the task is a regression task from R to R (see Figure 4(c) Step 1). Figure 4(a) shows that training is effective and
Figure 4: Training dynamic for VML based polynomial regression. The model is trained for 2 epochs, each with 10 steps.
Figure 5: Demonstration of prior injection, and comparison between Llama-3, GPT-4o and a neural net in the setting of sinusoidal regression.
converges. Subplots (b) and (c) show details of the model and optimization at steps 1, 2 and 3. At step 1, modelrandomly guesses the outputs. The optimizersays that it notices y has a larger range than x, and that they seem to have positive correlation; therefore, it updates modelto be a simple linear model. This linear model assumption leads to a jump in the training loss (see subplot
makes it realize that the linear model oversimplifies the relationship between x and y. It notices a non-linearity between x and y, and to capture this, it uses a quadratic model. This results in a better model and leads to a large decrease in the training loss. At step 3, optimizerswitches from model class selection to fitting the quadratic model. The resulting modelclosely fits the ground truth.
We generate from a sine function with Gaussian noise, i.e., , where and . Fitting a sine function is known to be difficult for neural nets in
Figure 6: Linearly separable two blobs classification based on VML. (b) plots the decision boundary of model with
shows that when contains only the definition of 1-D regression, it results in a linear model after training (see (c; right)). We can add a prior to by simply saying that the data looks like samples generated from a periodic function, which results in a very good approximation and it extrapolates
(see (b,c; mid)), indicating the capability of VML depends on the capability of the underlying LLM. However, we note that, the effectiveness of VML improves along with the capability of the LLM.
We generate a linearly separable from two blobs on a 2-D plane. is initialized by only specifying that the task is binary classification on a 2-D plane (see Figure 6(c) Step 1). Subplot (a) shows that training is effective and that it converges. At step 1, optimizerits inspection of the current batch of data has the pattern that data points with x > 0 belong to class 2, and data points with x < 0 belong to class 1; hence it updates modelto have a linear decision boundary at x = 0, which happens to be perfect. However, Figure 6(a) shows that the training loss does not immediately converge. We can investigate the cause and “debug” the optimizer by looking at what optimizer
it wants to further improve the model and utilize the new information from the current batch. Guided by this reasoning, modelbecomes a very deep decision tree, and the decision boundary has a reasonable margin towards the data (see Figure 6(b, c; right)).
We generate a non-linearly separable by creating data points on two concentric circles for the two classes. Besides the definition of binary classification on a 2-D plane, we also add a sentence to encode our inductive bias that the decision boundary is a circle into (see Figure 7(c) Step 1). At step 1, optimizerutilizes the prior information, and updates modelto have a circle decision boundary. For the rest of the training step, the optimizer mainly tries to find a good fit for the radius and the center of the decision boundary. At step 41, optimizerseems to be a good fit for the data, and no changes are needed, hence, it uses the same for model. Without the prior, VML can also learn a good model, but the performance shows large variance at the beginning
Figure 7: Non-linearly separable two circles classification with a prior in . (a; dashed) and (c; bottom right) also show results without the prior.
of training (see Figure 7(a; dashed)) due to the model class selection process similar to Figure 3(a). Figure 7(c; bottom right) shows the resulting without the prior, which is a decision tree.
To differentiate VML from prompt optimization, we qualitatively compare VML to a popular prompt optimization method called Automatic Prompt Engineer (APE) [67] on two tasks.
Linear regression as in Section 4.1. Figure 8(a) shows that the result from APE is vague and general. Such a description can easily be derived by humans through vi-
sual inspection of the data, and it does not learn deeper insights from the data, whereas VML is able to learn useful new information that is difficult to derive by visual inspection of the data. We can see that VML is doing pattern recognition, which is different from naive prompt optimization.
Text classification. Adopted from the Google BIG-bench[4], the task is to classify whether a name is more likely to be associated to female or male. Figure 8(b) shows that APE does return a correct description of the task, but it is, once again, very general. Conversely, VML is able to learn more detailed knowledge about the data pattern which cannot be done easily through visual inspection.
To demonstrate the capability of VML beyond simple machine learning problems, we include an experiment to demonstrate the effectiveness of VML in image classification. We use GPT-4o, which supports visual inputs, to take into account both image and text data. The task is to classify whether an input X-ray image has indications of pneumonia or not, see Figure 9(b) for image examples. Due to the cost of requesting GPT-4o, we create a subset of the dataset PneumoniaMNIST [58]. Our dataset consists of 100 training data and 100 test data (half pneumonia and half normal for both sets). Models are trained for 5 epochs. We try out two different model parameter initializations, one
Figure 9: Tiny-PneumoniaMNIST image classification for models with and without prior at initialization.
with prior and one without. We encode the inductive bias by simply adding a sentence as the prior, which states that the input is an X-ray image for identifying pneumonia, along with the definition of binary image classification (see Figure 9(c)). The test accuracy in (a) shows that both models are able to improve their performance on the task as the training epoch increases, and the model initialized with prior also outperforms the model without (in terms of both testing accuracy and training convergence). Additionally, by inspecting the parameters of model(see (d)), we can
associated to features of pneumonia (such as “acute infection”, “pneumatocele formation”), while the model parameters for the learner without any prior mainly use generic visual knowledge associated to features of lung (such as “visible opacities”, “uniform texture”). This observation well validates the effectiveness of using natural language to describe and encode inductive bias. More importantly, our experiment demonstrates the usefulness of learning in VML (i.e., the generalization performance can be improved over time), which is also one of the key differences to existing prompt engineering methods. Additionally, the interpretable nature of the learned model parameters in VML is crucial for applications in medical domain. The learned models can be validated by medical professionals, and their predictions are grounded by their verbalized reasonings.
Table 1: Comparison between VML and ICL on previous applications (without adding any prior information).
Quantitative comparison to in-context learning. Since
pare VML to ICL in all previous applications. Results
and classification (Cls) are mean square error (MSE )
and test accuracy (), respectively. We abbreviate linear regression as Reg-L, polynomial regression as Reg-P, two blob classification as Cls-TB, two circle classification as Cls-TC and medical image classification as Cls-MI. The results show that VML consistently outperforms ICL in all scenarios.
Figure 10: Training loss for ablation study. For each configuration, we show 5 individual runs (thin) and their mean (thick).
in the linear regression setting (Section 4.1).
Direct vs. indirect parameterization. As discussed in Section 3.4, we have a direct and indirect way to parameterize the optimizer. We compare both parameterization using the linear regression setting in Section 4.1. Figure 10(b) shows that the direct pa-
rameterization outperforms the indirect one. The direct parameterization is also more efficient and requires less LLM calls. Detailed experimental settings and discussions are given in Appendix A.3.
Figure 11: VML is able to learn to reason and solve symbolically generated GSM8K [32] questions with Llama-3.1-8B.
Figure 12: Binary digit pattern discovery of 4-integer vectors. No prior is injected to the model parameter in this experiment.
Recent work [32] shows that if we modify the original GSM8K [7] question by changing only the variable values (e.g., Figure 11(a)), the accuracy of many LLMs on the modified dataset will decline, which might be due to data contamination during pretraining. Our experiments show that VML can reduce such a performance variation and enable robust mathematical reasoning without changing the internal weights of LLMs. Specifically, we randomly generate a training set and a test set, both of size 100 (without overlap), using the template in Figure 11(a). If we directly evaluate the test set using Llama-3.1-8B, the average accuracy over 5 runs is around 80%. We use VML to learn a set of instructions for this task, the initial one is given in Figure 11(c). We use a batch size of 10 and train for 10 steps. Figure 11(b) shows that, on average over 5 runs, the test performance increases with the number of training step, and VML enables the model to achieve achieves 98% accuracy on the test
We can see that after step 1, the new instructions already recover the correct mathematical reasoning for the task, but the test accuracy is only around 91%. At step 3, optimizerrealizes that the error is mostly due the inaccurate calculations rather than the correctness of the instructions for model. Hence, the new instructions for modelincludes emphasis on calculation verification, which brings the test performance up to 96%. At step 6, optimizersays it notices many of model’s mistakes are still related to incorrect and incomplete calculations, therefore, it breaks down the instructions into a more detailed lists to allow easier reasoning and checking. Therefore, VML can enable LLMs to improve their reasoning ability at test-time by themselves without changing the internal weights.
To further demonstrate the interpretability of VML, we create a binary classification task on vectors of 4 digits. Class 0 contains vectors that only have digit ‘0’ in the first position, and Class 1 contains vectors that only have digit ‘0’ in the last position (see Figure 12(b)). Our dataset consists of 100 training data and 20 test data (half for both classes). Models are trained for 5 epochs (i.e., 50 steps with batch size 10). Figure 12(a) shows that both the training and test accuracy improves with the number of steps, hence learning is effective. The model is initialized with the definition of the task. During step 1, the optimizer says it notices that the first element of each input is often ‘0’ when the ground truth label is ‘0’, and decides to use a rule-based approach (see (c)). The resulting model description is half correct, which captures the pattern that ‘if the first element is 0, predicts 0’. After a few more steps, the optimizer is able to learn the correct description: ‘If the first element is 0, predicts 0. Otherwise, if the last element is 0, predict 1.’ Compared to the regression and 2D plane classification results, the learned model here is more interpretable than learning a neural network. Also, without any prior information, one will normally choose a universal approximator such as a neural network to solve this task, which will perform equally well but certainly not as interpretable. We also evaluate the performance of in-context learning (ICL) for this task as a baseline. Our result shows that VML is able to achieve 100% test accuracy with an interpretable description of the pattern, while ICL can only achieve 87.5% and does not explicitly output a pattern description.
on regression and classification tasks. The experiments show that VML can effectively perform these classical machine learning tasks, validating the potential of language models as function approximators. Despite the empirical effectiveness, there are a few limitations that remain to be addressed. First, training in VML still suffers from a relatively large variance. This is partially due to the stochasticity from the LLM inference, as well as the prompt design of the optimizer. Second, the output numerical error in LLMs results in inevitable fitting error. Concretely, even if the LLM correctly understands the underlying symbolic expression, there is still an output numerical error when performing inference on specific input values. This also suggests the intrinsic difficulty within LLMs to properly understand numbers (see [39, 63]). Third, the input data dimensionality and batch size are largely limited by the size of context window in LLMs.
One future direction is to study various aspects in VML using insights and concepts from classical machine learning. Some interesting questions include: Can we find a better design for the optimizer so that the training is more robust and efficient? How does the optimization landscape in VML differ from classical ML, what does it look like? Another interesting direction is to investigate the learning dynamics of VML, and compare it with how human learns. Since human also has a language model in mind, the same experiments in the paper can be conducted on human through messaging software.
The connection between LLMs and computers was discussed in our blog1. The VML framework is naturally motivated by the idea of LLMs acting as a modern computer, since we view the verbalized model parameters as a way to “program” the LLM. We then connect VML to the von Neumann architecture in the sense that both data and program instruction are put in the text prompt in VML.
WL was supported by the German Research Foundation (DFG): SFB 1233, Robust Vision: Inference Principles and Neural Mechanisms, TP XX, project number: 276693517. This work was partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy – EXC number 2064/1 – Project number 390727645. This work was supported by the German Federal Ministry of Education and Research (BMBF): T¨ubingen
for project 448588364 of the Emmy Noether Programme. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Tim Z. Xiao.
[1] AlphaProof and AlphaGeometry teams. Ai achieves silver-medal standard solving international mathematical olympiad problems. DeepMind blog, 2024. 31
[2] Jacob Andreas, Dan Klein, and Sergey Levine. Learning with latent language. In NAACL, 2018. 2
[3] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In NeurIPS, 2016. 1
[4] BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. 9
[5] Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R´e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024. 2
[6] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In NeurIPS, 2020. 31
[7] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. 11
[8] Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric Xing, and Zhiting Hu. Rlprompt: Optimizing discrete text prompts with reinforcement learning. In EMNLP, 2022. 2
[9] Samuel J. Gershman and David M. Blei. A tutorial on bayesian nonparametric models. Journal of Mathematical Psychology, 2011. 29
[10] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In NeurIPS, 2021. 30, 31
[11] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory. Neural Computation, 1997. 5
[12] Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023. 2
[13] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016. 3
[14] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017. 5
[15] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009. 5
[16] Andrei N. Kolmogorov. Three approaches to the quantitative definition of information. International Journal of Computer Mathematics, 1968. 29
[17] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In SIGOPS, 2023. 6
[18] Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998. 5
[19] Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for ”mind” exploration of large language model society. In NeurIPS, 2023. 2
[20] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022. 2
[32] Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229, 2024. 11
[33] Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. Slip: Self-supervision meets language-image pre-training. In ECCV, 2022. 2
[34] Tuomas Oikarinen, Subhro Das, Lam M Nguyen, and Tsui-Wei Weng. Label-free concept bottleneck models. In ICLR, 2023. 2
[35] Peter Orbanz and Yee Whye Teh. Bayesian nonparametric models. Encyclopedia of machine learning, 2010. 29
[36] Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi. What does a platypus look like? generating customized prompts for zero-shot image classification. In ICCV, 2023. 2
[37] Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with” gradient descent” and beam search. In EMNLP, 2023. 2, 5, 19
[38] Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023. 2
[39] Jing Qian, Hong Wang, Zekun Li, Shiyang Li, and Xifeng Yan. Limitations of language models in arithmetic and symbolic induction. arXiv preprint arXiv:2208.05051, 2022. 12
[40] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 2019. 31
[41] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021. 2
[42] James Requeima, John Bronskill, Dami Choi, Richard E Turner, and David Duvenaud. Llm processes: Numerical predictive distributions conditioned on natural language. arXiv preprint arXiv:2405.12856, 2024. 30
[43] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Yu Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 30
[44] Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024. 2
[45] Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In ICCV, 2023. 2
[46] Alessandro Sordoni, Eric Yuan, Marc-Alexandre Cˆot´e, Matheus Pereira, Adam Trischler, Ziang Xiao, Arian Hosseini, Friederike Niedtner, and Nicolas Le Roux. Joint prompt optimization of stacked llms using variational inference. In NeurIPS, 2023. 2
[47] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 3, 5
[48] P. Vitanyi and Ming Li. Minimum description length induction, bayesianism, and kolmogorov complexity. In ISIT, 1998. doi: 10.1109/ISIT.1998.708951. 29
[49] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022. 2
[50] Larry Wasserman. All of nonparametric statistics. Springer Science & Business Media, 2006. 29
[51] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022. 2
[52] Yuxin Wen, Neel Jain, John Kirchenbauer, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. In NeurIPS, 2023. 2
[53] Jason Weston and Sainbayar Sukhbaatar. System 2 attention (is something you might need too). arXiv preprint arXiv:2311.11829, 2023. 2
[54] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023. 2
[55] Yaqi Xie, Chen Yu, Tongyao Zhu, Jinbin Bai, Ze Gong, and Harold Soh. Translating natural language to planning goals with large-language models. arXiv preprint arXiv:2302.05128, 2023. 2
[56] An Yan, Yu Wang, Yiwu Zhong, Chengyu Dong, Zexue He, Yujie Lu, William Yang Wang, Jingbo Shang, and Julian McAuley. Learning concise and descriptive attributes for visual recognition. In ICCV, 2023. 2
[57] Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. In ICLR, 2024. 2, 4
[58] Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, and Bingbing Ni. Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific Data, 2023. 9
[59] Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J Prenger, and Animashree Anandkumar. Leandojo: Theorem proving with retrieval-augmented language models. In NeurIPS, 2023. 31
[60] Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel Jin, Chris Callison-Burch, and Mark Yatskar. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In CVPR, 2023. 2
[61] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In NeurIPS, 2023. 2
[62] Yao Yao, Zuchao Li, and Hai Zhao. Beyond chain-of-thought, effective graph-of-thought reasoning in large language models. arXiv preprint arXiv:2305.16582, 2023. 2
[63] Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, and Songfang Huang. How well do large language models perform in arithmetic tasks? arXiv preprint arXiv:2304.02015, 2023. 12
[64] Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic ”differentiation” via text. arXiv preprint arXiv:2406.07496, 2024. 5, 19, 30
[65] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, 2023. 2
[66] Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022. 2
[67] Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910, 2022. 2, 3, 5, 9
Table of Contents
Here we provide additional details for the experiments in Section 4.8.
In-context learning (ICL) is a popular method for adapting LLMs to downstream tasks. Here, we compare the performance of VML and ICL in various tasks from previous sections. For all tasks, we provide the entire training set as in-context examples, and query the individual test data independently. The resulting predictions for regression and 2D classification are plotted in Figure 13. The full comparison between VML and ICL are shown in Table 2. We can see that VML outperforms ICL in regression and medical image classification, and has the same performance to ICL in the simpler classification tasks, e.g., two blobs and two circles. Within our framework, ICL can be understood as a nonparameteric method, while VML is a parameteric one (see Appendix G.3 for more discussion).
Figure 13: Predictions of in-context learning (ICL) for the same regression and classification tasks with Llama-3 70B.
Table 2: Test performance for in-context learning (ICL) and verbalized machine learning (VML) on various tasks from previous section (without adding prior information). The ICL results are chosen from the best across 5 runs. The metrics used for regression (Reg) and classification (Cls) are mean square error (MSE ) and test accuracy () correspondingly.
To verify whether the performance of VML scale with the capability of LLMs, we compare three Llama-3.1 models of different sizes, i.e., 8B, 70B, and 405B, in the linear regression setting. Figure 14 shows the training loss of 5 individual runs (thin) and their mean (thick) for each LLM. Note that due to the high variance nature of using LLMs for optimization, we select the 5 best runs out of 10 runs for this comparison. We see that more powerful LLMs (e.g., 405B) learn faster and achieve lower training loss.
Figure 14: Llama-3.1 LLMs scale versus VML training performance in linear regression setting. 5 individual runs (thin) and mean (thick) for each LLM.
There are different ways to implement the optimization step in VML. We choose to directly update the model parameters in a single LLM call by providing all the necessary information, i.e., in Algorithm 1. If we choose a lower abstraction level, we can decompose the direct single step optimization into indirect multi-step optimization. Algorithm 2 illustrates how can be decomposed into four consecutive functions, which resemble the operations of computation graphs in most numerical machine learning frameworks. Specifically, we calculate the following step-by-step: (1) the quality of the predictions (i.e., evaluate the loss function ); (2) the ‘gradient’ of the loss w.r.t. the predictions denoted as ; (3) the ‘gradient’ of the loss w.r.t. the parameters denoted as ; (4) update the current to using the ‘gradient’ . The ‘gradients’ here are known as ‘textual gradients’ in prompt optimization literature [37, 64], which are essentially text-based feedback from LLMs.
We compare the two approaches in the linear regression setting using Llama-3.1 70B. Figure 15 shows, for both the direct and indirect optimization, the training loss of 5 individual runs (thin) and their mean (thick). We can see that the indirect method performs slightly worse than the direct method. The reason can be there are 3 more prompt templates to design, which is harder than designing just one, and has a higher risk of losing information in the pipeline.
Figure 15: Training loss of direct and indirect optimization in linear regression setting using Llama-3.1 70B. The lines show 5 individual runs (thin) and mean (thick) for each approach.
Figure 16: Training dynamics for two different optimization settings in the polynomial regression setting. One has access to the accurate loss computation, and the other does not.
The VML algorithm at Algorithm 1 specifies that the arguments for consist of the inputs x, the predictions , the targets y, the current model parameter and the optimizer configurations Hence, there is no explicit definition of the loss function for the optimizer (see Figure 2(right) for an example of the verbalized loss function). It is up to the optimizer itself to evaluate the difference between the prediction and the target y. We are interested in question that whether having access to the real training loss (defined and computed for logging purpose), mean squared error in this case, can help the optimizer to better navigate the training trajectory.
The orange line in Figure 16(c) shows that having such accurate loss feedback might not help, and might even decrease the performance in this scenario. One possible explanation is that the single loss value itself does not contain too much information. Moreover, as the exact form of the loss function can be fed to LLM easily, the LLM might spend additional efforts to estimate the exact form of the loss function, which makes the convergence even more difficult. It actually makes intuitive sense that verbalized loss function (i.e., using natural language to explain the target of the loss function) works better in the VML framework. For example, knowing how does each prediction contributes to the loss value can be more informative and a single overall loss value, since the model might be doing well for some data but not the others, and we only want to improve the model for points with the bad predictions.
C NUMERICAL ERROR OF LLMS IN REPRESENTING SYMBOLIC FUNCTIONS
Figure 17: Functions evaluations and numerical error in Llama-3 70B
Figure 18: Functions evaluations and numerical error in GPT-4o.
LLMs are designed to do language modeling, rather than exact calculations. Hence, their performance on evaluating functions can be unreliable, and might result in error. Figure 17 shows that Llama-3 is very comfortable in evaluating the given linear and polynomial function, as the mean is quite accurate. The variance over 10 runs is also pretty small, except for one or two points. However, for a more complex function such as sin(x), Llama-3 is only able to return small error approximately in the range of . Both the error and the variance are large out side of this range. This explains the non-smoothness for the function in Figure 5(b; right), which has sin(x + 1.0) in the learned model parameters.
By switching to the more powerful model, GPT-4o, we can see from Figure 18 that both the error and the variance decrease. In particular, for sin(x), GPT-4o returns smaller error in a larger range, (i.e., ). This implies that as the capability of LLMs improves, their performance in evaluating more complex functions also improves.
Nevertheless, this is currently still a limitation for VML if the optimizer chooses to use complex mathematical functions as the model parameter. If the evaluation of the function has an error, then during training, the optimizer will update the model parameters based on noisy signal. This can lead to large variance in training and slow convergence. Future work should look into methods for minimizing the numerical error in LLMs function evaluation.
After applying the sine function to the input, you always add 2 to the resulting value. This shifts the sine wave vertically by 2 units. x
Figure 19: Function evaluations based on the natural language description of the corresponding symbolic sine function.
Figure 19 shows that if we use natural language to describe the symbolic sine function (see subfigure(a)), GPT-4o is able to produce more accurate evaluations than using the symbolic function (see (c)). The accuracy of Llama-3 70B also increases, even though it still under performs GPT-4o (see (b)). This is likely due to Llama-3 is less capable in instruction following than GPT-4o. This observation implies that in VML, we might want to instruct the optimizer to avoid using complex symbolic functions in the update and to prefer the natural language description of the function.
In this section, we supplement experiments of Llama-3 70B with a python interpreter. Despite the fact that LLMs are able to perform numerical data tasks, the incorporation of a python interpreter further improves LLMs ability to deal with numerical values. Specifically, we use the open-interpreter2 library to add a python interpreter to Llama-3 70B, such that the LLM has the ability to use python programs to evaluate symbolic functions or perform numerical operations. We follow the same experimental settings as in Section 4.3 (sinusoidal regression of y = sin(x) + 2). The training data is only sampled from with additive Gaussian noise. The in-domain testing data is sampled from the same range, while the out-of-domain testing data is sampled from
The results are given in Appendix D. We can observe that with the python interpreter, Llama-3 70B can effectively learn periodic functions, while in the original experiment (i.e., Figure 5(b)), the same LLM is unable to approximate periodic function even with a prior. The results show that the tool-using ability can further improve the learnability of VML. The example logs for inference with the learned model is showed below.
Table 3: Evaluation (using mean squared error ) on sinusoidal regression as in Figure 5(b) for three different models including (1) neural networks, (2) Llama3 with prior, and (3) Llama3 with prior and code interpreter.
E CONNECTION BETWEEN PREDICTION VARIANCE AND MODEL PARAMETERS IN VML
(a) shows that if we only provide the information that the task is a regression task and do not specify the model at all, the LLM tends to predict a linear function (slope ) with increasing variance as x moves away from 0. (b) shows that if we specify there is a linear relationship between inputs and outputs, the LLM will predict a linear function with a similar slope as (a) but with smaller variance. (c) shows that if we specify the explicit form of the linear function, the slope will still be around 1, but the variance are larger when x > 1. (d, e, f) show that by providing a range for the values of the unknown variables, the LLM tends to use the mid-point of the range for the values, and a smaller range does correspond to a smaller variance in prediction.
(b) “The updated pattern definitions utilize a linear regression framework characterized by a slope of 3.34 and an intercept of 3.28. The revised pattern equations are expressed as:
(f) “
where we can easily sample multiple model parameters and compute its probability with logits. Specifically, we have that . Using this idea, it is actually quite easy to obtain the ensembled output that is weighted by posterior distribution.
Incremental updating. Alternatively, we can choose to update the model parameters in an incremental fashion without remove the previous model parameters completely. We denote the optimizer LLM generates the new model parameters . Then the model parameters at the step t + 1 is
We are interested in how Occam’s razor can be applied in VML. One natural way of doing so is to constrain the model parameters to be a small and fixed length. This essentially is
We can see that as long as we constrain the text token length of the model parameters to be small, the learner will perform an automatic model simplification, as it will try to discover the data pattern with concise and simple text. There are many more ways to implement the Occam’s razor in VML. More interestingly, it is also possible to incorporate a structural constraint to the model parameters. For example, it can be causal knowledge (e.g., text representation of a causal graph), logic formula or decision trees. Our work opens up many more possibilities on Occam’s razor in VML, and rethinking the form of Occam’s razor in VML is very crucial in unlocking the strong interpretability and controllability of inductive biases.
Nonparameteric methods get around the problem of model selection by fitting a single model that can adapt its complexity to the data [50, 35, 9]. These methods allow the model complexity to grow with the number of observed data. This is different to parametric models which have fixed number of parameters. In VML, as showed in Section 4.2, the model complexity is also flexible and adapts to the data during training. Similarly, the concept of in-context learning (ICL) can also be understood as nonparametric methods in the lens of LLMs as function approximators. ICL denotes the method of using LLMs to solve new tasks by only providing the task demonstrations or examples in the prompt with natural language. Given a new data point, an LLM predicts its output using information in the provided demonstrations. From the perspective of VML, ICL in an LLM essentially defines a nonparametric model implicitly using the demonstrated examples in the natural language space.
Verbalized machine learning aims to provide a framework for LLMs to deal with machine learning tasks, with the ability to fully interpret the learned knowledge with natural language. We believe this framework will be increasingly more powerful, as LLMs get more powerful. We have already observed the performance improvement of VML by switching from Llama-3 to GPT-4o.
Figure 22: Prompt templates of VML for the learner and optimizer for the linear regression (Llama-3-70B without prior).
Figure 23: Prompt templates of VML for the learner and optimizer for the polynomial regression (Llama-3-70B without prior).
Figure 24: Prompt templates of VML for the learner and optimizer for the sinusoidal regression (GPT-4o with prior).
Figure 25: Prompt templates of VML for the learner and optimizer for the two blobs classification (Llama-3-70B without prior).
Figure 26: Prompt templates of VML for the learner and optimizer for the two circles classification (Llama-3-70B with prior).
Figure 27: Prompt templates of VML for the learner and optimizer for the text classification (Llama-3-70B without prior).