Empowering and Assessing the Utility of Large Language Models in Crop Science

1Shanghai Artificial Intelligence Laboratory, 2Yazhouwan National Laboratory, 3China Agricultural University, 4Hangzhou Dianzi University,
*Indicates Equal Contribution
Indicates Corresponding Authors
NIPS 2024

👀 Abstract

Large language models (LLMs) have demonstrated remarkable efficacy across knowledge-intensive tasks. Nevertheless, their untapped potential in crop science presents an opportunity for advancement. To narrow this gap, we introduce CROP, which includes a novel instruction tuning dataset specifically designed to enhance LLMs’ professional capabilities in the crop science sector, along with a benchmark that serves as a comprehensive evaluation of LLMs’ understanding of the domain knowledge. The CROP dataset is curated through a task-oriented and LLM-human integrated pipeline, comprising 210,038 single-turn and 1,871 multi-turn dialogues related to crop science scenarios. The CROP benchmark includes 5,045 multiple-choice questions covering three difficulty levels. Our experiments based on the CROP benchmark demonstrate notable enhancements in crop science-related tasks when LLMs are finetuned with the CROP dataset. To the best of our knowledge, CROP dataset is the first-ever instruction tuning dataset in the crop science domain. We anticipate that CROP will accelerate the adoption of LLMs in the domain of crop science, ultimately contributing to global food production.

📚 Crop Dataset


fail
Figure 2. Hierarchical view of tasks in CROP dataset. Dialogues can be single-turn or multi-turn (first tier). The second tier specifies task types. The third tier further decomposes these types into finer-grained tasks. Task-specified topics are rendered around the taxonomy.

fail
Table 1. Composition of single-turn dialogue dataset. Please note that despite our data-cleaning efforts, the final CROP dataset inevitably contain a small amount of data (<0.5%) from other grains like wheat. As this portion does not dominantly influence the fine-tuning results, it is included into the final CROP dataset. We have listed it explicitly in the table to avoid any misleading counts.

fail
Table 2. Composition of multi-turn dialogue dataset.

Blue

denotes 3-turn dialogue,

green

denotes 4-turn dialogue, and

yellow

denotes 5-turn dialogue.

📈 Crop Benchmark


fail
Figure 3. Content distribution of benchmark. We list the keywords in the produced benchmark for a deeper insight. Darker colors indicate a higher frequency of occurrence, while lighter colors indicate a lower frequency of occurrence.

fail
Table 3. Statistics of the benchmark.

fail
Table 4. Benchmark Comparison. CROP benchmark surpasses existing datasets in terms of quantity and locality.
1 https://www.certifiedcropadviser.org/become-certified/certifications/.
2 https://mais500p500r.sct.embrapa.br/view/index.php. For EMBRAPA, we count the number of test-based inquires related to rice and corn.
3 https://www.agriexam.com/agriculture-previous-year-question-paper.

📦 Experiments


fail
Table 5. Performance of selected LLMs on the CROP benchmark. Open-source LLMs are tuned on the CROP dataset in 4 epochs. We indicate the accuracy changes of the fine-tuned LLMs compared to the original in

blue

, where the accuracy has generally improved across various difficulty levels.
1 GPT-4 API is “gpt-4-turbo-2024-04-09”. GPT-3.5 API is “gpt-3.5-turbo-0125”. Claude-3 API is “claude-3-opus-20240229”. Qwen API is “qwen-max”.
2 The accuracy of 0.000 and 1.000 is explained in Paper Section 4.3.

📰 Poster

⚖ License

The dataset and benchmark used in this project are CC BY NC 4.0 (allowing only non-commercial use).

📌 Citation