MIT’s introduction of the LAB method marks a significant milestone in the evolution of LLM training. By offering a systematic, cost-effective approach to generating synthetic data and incorporating task-specific knowledge and skills, LAB has the potential to transform the landscape of enterprise AI, making sophisticated chatbots and other AI systems more accessible and effective than ever before.
The traditional method of training LLMs relies heavily on vast quantities of raw text data, often scraped from the internet, which is then supplemented with task-specific information during the fine-tuning stage. This process, while effective to a degree, is fraught with difficulties, primarily due to the scarcity of high-quality instruction data. Producing this data manually is both time-consuming and expensive, and while synthetic data offers a cost-effective alternative, it frequently lacks the necessary variety and depth.
LAB Method
Enter MIT’s LAB method, a strategic approach to generating synthetic data tailored specifically to the tasks intended for the chatbot. This method significantly reduces the time and financial investment traditionally required for LLM training, without compromising on the quality or breadth of instruction data. At the heart of LAB is a taxonomy that provides a structured framework for identifying the knowledge and skills a chatbot needs, allowing for the generation of high-quality instructions that facilitate the seamless incorporation of new abilities into the LLM.
This process involves a “teacher model” that creates pairs of questions and answers, each meticulously crafted to address specific tasks. For instance, in preparing a chatbot to draft an email summarizing a company’s quarterly earnings, the taxonomy categorizes the necessary instruction data into knowledge, foundational skills, and compositional skills. This structured approach ensures comprehensive coverage of all requisite areas, including accounting knowledge, mathematical abilities, and writing and reasoning skills.
One of the most innovative aspects of LAB is its phased-training protocol. This method segments the training process into stages, starting with simple instructions before progressing to more complex, narrative-like instructions. This graduated approach mimics human learning, allowing the LLM to build upon its existing knowledge and skills progressively. Empirical evidence from MIT’s research indicates that the sequence in which knowledge and skills are introduced plays a crucial role in the model’s ability to assimilate new information effectively.Â
Results
The practical application of the LAB method has yielded impressive results. MIT Research applied LAB to generate a synthetic dataset of 1.2 million instructions, training two open-source LLMs, Labradorite 13B and Merlinite 7B, on this data. The aligned models demonstrated competitive performance across several benchmarks, outshining chatbots trained on human-generated data and those aligned on larger volumes of synthetic data.
MIT’s LAB stands out not only for its efficiency and effectiveness in training LLMs but also for its potential to democratize AI development. By facilitating the generation of high-quality, task-specific instruction data, LAB enables smaller models to achieve advanced capabilities, levelling the playing field against models trained on significantly more resources.