Empowering and Assessing the Utility of Large Language Models in Crop Science

¹Shanghai Artificial Intelligence Laboratory, ²Yazhouwan National Laboratory, ³China Agricultural University, ⁴Hangzhou Dianzi University,
^*Indicates Equal Contribution ^†Indicates Corresponding Authors
NIPS 2024

Paper Code Data Huggingface

fail — **Figure 1. Schematic overview of an intended use case.** By fine-tuning a base LLM using the proposed CROP dataset, we obtain a new version whose answer becomes more accurate and comprehensive, which is validated by the proposed CROP benchmark objectively.

👀 Abstract

Large language models (LLMs) have demonstrated remarkable efficacy across knowledge-intensive tasks. Nevertheless, their untapped potential in crop science presents an opportunity for advancement. To narrow this gap, we introduce CROP, which includes a novel instruction tuning dataset specifically designed to enhance LLMs’ professional capabilities in the crop science sector, along with a benchmark that serves as a comprehensive evaluation of LLMs’ understanding of the domain knowledge. The CROP dataset is curated through a task-oriented and LLM-human integrated pipeline, comprising 210,038 single-turn and 1,871 multi-turn dialogues related to crop science scenarios. The CROP benchmark includes 5,045 multiple-choice questions covering three difficulty levels. Our experiments based on the CROP benchmark demonstrate notable enhancements in crop science-related tasks when LLMs are finetuned with the CROP dataset. To the best of our knowledge, CROP dataset is the first-ever instruction tuning dataset in the crop science domain. We anticipate that CROP will accelerate the adoption of LLMs in the domain of crop science, ultimately contributing to global food production.

📚 Crop Dataset

Figure 2. Hierarchical view of tasks in CROP dataset. Dialogues can be single-turn or multi-turn (first tier). The second tier specifies task types. The third tier further decomposes these types into finer-grained tasks. Task-specified topics are rendered around the taxonomy.

Table 1. Composition of single-turn dialogue dataset. Please note that despite our data-cleaning efforts, the final CROP dataset inevitably contain a small amount of data (<0.5%) from other grains like wheat. As this portion does not dominantly influence the fine-tuning results, it is included into the final CROP dataset. We have listed it explicitly in the table to avoid any misleading counts.

Table 2. Composition of multi-turn dialogue dataset.

Blue

denotes 3-turn dialogue,

green

denotes 4-turn dialogue, and

yellow

denotes 5-turn dialogue.

📈 Crop Benchmark

Figure 3. Content distribution of benchmark. We list the keywords in the produced benchmark for a deeper insight. Darker colors indicate a higher frequency of occurrence, while lighter colors indicate a lower frequency of occurrence.

Table 3. Statistics of the benchmark.

Table 4. Benchmark Comparison. CROP benchmark surpasses existing datasets in terms of quantity and locality.
¹ https://www.certifiedcropadviser.org/become-certified/certifications/.
² https://mais500p500r.sct.embrapa.br/view/index.php. For EMBRAPA, we count the number of test-based inquires related to rice and corn.
³ https://www.agriexam.com/agriculture-previous-year-question-paper.

📦 Experiments

Table 5. Performance of selected LLMs on the CROP benchmark. Open-source LLMs are tuned on the CROP dataset in 4 epochs. We indicate the accuracy changes of the fine-tuned LLMs compared to the original in

blue

, where the accuracy has generally improved across various difficulty levels.
¹ GPT-4 API is “gpt-4-turbo-2024-04-09”. GPT-3.5 API is “gpt-3.5-turbo-0125”. Claude-3 API is “claude-3-opus-20240229”. Qwen API is “qwen-max”.
² The accuracy of 0.000 and 1.000 is explained in Paper Section 4.3.

📌 Citation

 @inproceedings{zhangempowering,
  title={Empowering and Assessing the Utility of Large Language Models in Crop Science},
  author={Zhang, Hang and Sun, Jiawei and Chen, Renqi and Liu, Wei and Yuan, Zhonghang and Zheng, Xinzhe and Wang, Zhefan and Yang, Zhiyuan and Yan, Hang and Zhong, Han-Sen and others},
  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track}
}