NASA, IBM Develop INDUS Large Language Models for Advanced Science Research

INDUS has been integrated into NASA's Science Discovery Engine.

NASA has partnered with IBM to develop INDUS, a suite of large language models (LLMs) designed to advance scientific research across various domains.

This collaboration, facilitated through Space Act Agreements, is led by NASA's Interagency Implementation and Advanced Concepts Team (IMPACT) and IBM.

NASA, IBM Develop INDUS Large Language Models for Advanced Science Research
NASA has partnered with IBM to develop INDUS, a suite of large language models (LLMs) designed to advance scientific research. Ezra Acayan/Getty Images

The INDUS Suite of NASA and IBM

INDUS incorporates specialized LLMs for Earth science, biological and physical sciences, heliophysics, planetary sciences, and astrophysics. These models are trained on curated scientific data to enhance their accuracy and domain relevance.

The INDUS suite comprises two main types of models: encoders and sentence transformers. Encoders convert natural language text into numeric formats for LLM processing, using a specialized vocabulary specific to scientific domains.

INDUS includes over 50,000 unique scientific terms, improving upon generic LLMs by recognizing complex scientific concepts like biomarkers and phosphorylated molecules.

According to NASA, the encoders were trained on a comprehensive corpus of 60 billion tokens, covering diverse scientific disciplines. To optimize performance, the IMPACT-IBM team fine-tuned the sentence transformer models on around 268 million text pairs, including abstracts, questions, and answers.

This approach enhances INDUS's ability to perform tasks such as scientific question-answering and entity recognition in Earth sciences. Additionally, smaller and faster versions of both encoder and transformer models were developed for latency-sensitive applications, demonstrating INDUS's versatility across different computational needs.

NASA evaluation tests validate INDUS's effectiveness in retrieving relevant scientific information from extensive data repositories. This capability supports applications like the Open Science Data Repository (OSDR) API, wherein INDUS powers intuitive search functionalities and aids in dataset curation.

INDUS has been instrumental in categorizing publications referencing GES-DISC datasets, streamlining research workflows, and enhancing data discovery at NASA's Goddard Earth Sciences Data and Information Services Center (GES-DISC).

Science Discovery Engine of NASA

Dr. Sylvain Costes from NASA's Biological and Physical Sciences Division highlights INDUS's integration benefits, particularly in improving data curation efficiency and enhancing user experiences within scientific research platforms.

Moreover, INDUS has been integrated into NASA's Science Discovery Engine (SDE), enhancing search accuracy and relevance across NASA's vast repository of open science data.

The collaboration aims to advance artificial intelligence (AI) to support scientific discovery. INDUS models are openly accessible on platforms like Hugging Face and are poised to benefit the broader scientific community.

Future releases will include benchmark datasets for climate change, Earth science QA, and information retrieval, further empowering researchers with tools to effectively navigate and leverage vast scientific knowledge.

NASA noted that the INDUS encoder models are versatile for use in scientific domains, while the INDUS retriever models facilitate effective information retrieval for RAG applications.

Byline
Byline


ⓒ 2024 TECHTIMES.com All rights reserved. Do not reproduce without permission.
Join the Discussion
Real Time Analytics