Logo MIRAI:

Evaluating LLM Agents for Event Forecasting

1University of California, Los Angeles,
2California Institute of Technology, * Equal Contribution

A Quick Glance into MIRAI

(Overview of LLM agent forecasting with the MIRAI benchmark)

Introduction

MIRAI is a benchmark crafted for evaluating LLM agents for temporal forecasting of international events, with tool-use and complex reasoning. We finalized a collection of 991,759 GDELT event records, corresponding to 59,161 unique events and 296,630 unique news articles. Our test set contains 705 query and answer pairs on forecasting an event of given timestamp between two countries, with a 100 balanced test subset.

geometric reasoning

MIRAI comprehensively covers global event data. The circular chart shows the relation hierarchy and distribution in MIRAI.

We display our curated events over the world map in the heatmap figures below, illustrating global coverage and varying intensities of conflict and mediation across regions.

The MIRAI Benchmark

We refine the GDELT event database with careful cleaning and parsing to curate a series of relational prediction tasks with varying forecasting horizons, assessing LLM agents' abilities from short-term to long-term forecasting.

MIRAI comprehensively evaluates the agents' capabilities in three dimensions: 1) autonomously source and integrate critical information from large global databases; 2) write codes using domain-specific APIs and libraries for tool-use; and 3) jointly reason over historical knowledge from diverse formats and time to accurately predict future events.

framework

Overview of the LLM agent's interaction with the multi-source environment using the ReAct strategy for forecasting a query event. The framework consists of three main steps: (1) Think: The agent analyzes the current status and plans the next action based on the query and the provided API specifications. (2) Act: The agent generates a Single Function call or a Code Block to retrieve and analyze relevant data from the database. (3) Execute: The Python interpreter runs the generated code with the API implementation and database and produces observations. These steps are iteratively performed until the agent reaches a final forecast for the future relation.

Experiments

Evaluate Forecasting with Different Agent Methods and Tools. To investigate the effect of different tools (APIs) and agent tool-use strategies, we use gpt-3.5-turbo-0125 as the base model and evaluate on the 705 queries in the test set.

Evaluate Forecasting with Different Base LLMs. We further investigate the role of the underlying LLMs in the agent's performance.

Code execution error analysis. Our implemented agents interact with tools via code but often encounter execution error. We can see that invalid dates and invalid attributes are the two most frequent errors even for gpt-4-turbo. Notably, gpt-4o makes significantly fewer execution errors.

Analysis of Agent Behaviours

Impact of temporal distance of the forecasting target. Our defined event forecasting task varies by temporal distance l, which specifies how far into the future we want to predict. As the temporal distance increases, the F1 score decreases and KL-divergence increases. This indicates that the agent's ability to provide accurate predictions diminishes for events further in the future.

Forecasting accuracy on different relation types. We split the datasets into distinct quadratic relation classes and compute the F1 score for each class. All models exhibit significantly higher performance for ''verbal cooperation'' and ''material conflict'', while lower in the other two categories.

How tool-use ordering influences forecasting. We further investigate the impact of action order on the GPT-4o agent in ''Single Function'' mode. Typically, the agent begins with get_relation_distribution or get_event to gather an initial set of recent and frequent events for key information, and often concludes with browse_news_article and get_news_articles, which retrieve news content to make accurate forecasts.

Forecasting Example

Query Details

  • Query Quadruplet: (2023-11-03, AUS, ?, CHN)
  • Temporal Distance: 1; therefore, the current date is 2023-11-02
  • Agent Max Steps: 20

Query Prompt

Please forecast the relations that Australia will take towards China on November 03, 2023 based on historical information up to November 02, 2023. I.e., forecast the relation CAMEO codes in query event Event(date=2023-11-03, head_entity=ISOCode(AUS), relation=CAMEOCode(?), tail_entity=ISOCode(CHN)).

System Prompt

You are an expert in forecasting future events based on historical data. The database contains news articles from January 1, 2023 to the current date November 02, 2023 and the events extracted from these articles. The events are in the form of (date, subject country, relation, object country), where the countries are represented by ISO 3166-1 alpha-3 codes and the relations are represented by the CAMEO codes defined in the "Conflict and Mediation Event Observations" ontology. The relations are hierarchical: first-level relations are general parent relations represented by two-digit CAMEO codes, while second-level relations are more specific child relations represented by three-digit CAMEO codes. Child relations have the same first two digits as their parent relations. For example, "01" is a first-level relation, and "010" and "011" are some of its second-level relations. The relations in the database are represented in the second-level form.

Your task is to forecast the future relations between two entities in a given query. You have access to a defined Python API that allows you to query the database for historical events and statistics, and to get precise information about the ISO country codes and CAMEO relation codes. You are also authorized to utilize additional safe, well-established Python libraries such as numpy, pandas, scikit-learn, and NetworkX to enhance your data analysis and forecasting accuracy.

The defined API is described as follows:

[FULL API SPECIFICATION HERE]

You will use an iterative approach, interleaving "Thought", "Action", and "Observation" steps to collect information and perform the forecast. You may perform up to 20 iterations. The steps are as follows:

  • Thought: Analyze the current information and reason about the current situation, and predicts which API you want to use (try to use different APIs to collect diverse information) or make a decision that you want to make a final answer.
  • Action: Use the API to gather more information or provide the final forecast.
    • If gathering more data: the action must be an executable Python code snippet that starts with ```python and ends with ```. It can contain multiple lines of codes and function calls using the defined API or Python libraries. You must use print() to output the results, and only the printed output will be returned in the observation step.
    • If making the final forecast: the action must start immediately with "Final Answer:", and follow with the answer in the expected JSON format. This should not be enclosed within triple backticks.
  • Observation: Return the printed output of the executed code snippet.

To make a reasonable forecast, you should collect both news and relational evidence to support your prediction. When you are fully confident that you accumulate enough information to make the final forecast, you should start the "Thought" with your reasoning using the news and structural information to make the prediction, and then start the "Action" step with Final Answer: followed by the answer in the expected JSON format. The answer should be a JSON dictionary where the keys are the forecasted two-digit first-level CAMEO codes and the values are lists of forecasted three-digit second-level CAMEO codes that are child relations of the key. For example, Action: Final Answer: {"01": ["010", "011", "012"], "02": ["020", "023"]}.

The final answer will be evaluated based on the precision and recall of the forecasted first-level and second-level relations, so only include confident first-level and second-level CAMEO codes in your final forecast.

Try to use different APIs and Python libraries to collect diverse information (including multi-hop relations), such as the precise meaning of CAMEO codes, insights from news content, relational data, and statistical analyses to support your forecasts. Consider not only the frequency of the relations but also the temporal aspects of the data when making your forecast.

Agent ReAct Process 1

Thought: To forecast the relations that Australia will take towards China on November 03, 2023, we need to gather historical data on the interactions between these two countries. This includes examining the frequency and types of relations observed in the past, as well as any recent trends or significant events that might influence future relations.

We will proceed with the following steps:

  • (a) Retrieve Historical Events: Gather events involving Australia (AUS) and China (CHN) to understand the context and types of relations that have occurred in the past.
  • (b) Analyze Relation Distribution: Analyze the distribution of different CAMEO codes in these events to identify the most common types of interactions.
  • (c) Examine Recent Trends: Focus on recent events to detect any emerging trends or shifts in relations.
  • (d) Contextual Analysis: Review news articles to gain insights into the context and underlying reasons for these relations.

Let's start by retrieving historical events between Australia and China.

Action:

from datetime import datetime

# Define the date range for the historical data collection
start_date = Date("2023-01-01")
end_date = Date("2023-11-02")
date_range = DateRange(start_date=start_date, end_date=end_date)

# Define the ISO codes for Australia and China
aus_code = ISOCode("AUS")
chn_code = ISOCode("CHN")

# Retrieve historical events involving Australia and China
events = get_events(date_range=date_range, head_entities=[aus_code], tail_entities=[chn_code])

# Print the retrieved events
for event in events:
    print(event)

Observation:

Event(date=Date("2023-11-02"), head_entity=ISOCode("AUS"), relation=CAMEOCode("042"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-11-02"), head_entity=ISOCode("AUS"), relation=CAMEOCode("040"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-11-02"), head_entity=ISOCode("AUS"), relation=CAMEOCode("036"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-11-01"), head_entity=ISOCode("AUS"), relation=CAMEOCode("042"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-11-01"), head_entity=ISOCode("AUS"), relation=CAMEOCode("173"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-11-01"), head_entity=ISOCode("AUS"), relation=CAMEOCode("036"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-10-31"), head_entity=ISOCode("AUS"), relation=CAMEOCode("036"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-10-31"), head_entity=ISOCode("AUS"), relation=CAMEOCode("042"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-10-31"), head_entity=ISOCode("AUS"), relation=CAMEOCode("046"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-10-30"), head_entity=ISOCode("AUS"), relation=CAMEOCode("036"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-10-30"), head_entity=ISOCode("AUS"), relation=CAMEOCode("042"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-10-30"), head_entity=ISOCode("AUS"), relation=CAMEOCode("046"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-10-29"), head_entity=ISOCode("AUS"), relation=CAMEOCode("036"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-10-27"), head_entity=ISOCode("AUS"), relation=CAMEOCode("036"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-10-27"), head_entity=ISOCode("AUS"), relation=CAMEOCode("042"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-10-26"), head_entity=ISOCode("AUS"), relation=CAMEOCode("173"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-10-26"), head_entity=ISOCode("AUS"), relation=CAMEOCode("061"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-10-26"), head_entity=ISOCode("AUS"), relation=CAMEOCode("042"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-10-25"), head_entity=ISOCode("AUS"), relation=CAMEOCode("042"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-10-23"), head_entity=ISOCode("AUS"), relation=CAMEOCode("042"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-10-23"), head_entity=ISOCode("AUS"), relation=CAMEOCode("036"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-10-23"), head_entity=ISOCode("AUS"), relation=CAMEOCode("031"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-10-23"), head_entity=ISOCode("AUS"), relation=CAMEOCode("061"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-10-22"), head_entity=ISOCode("AUS"), relation=CAMEOCode("040"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-10-22"), head_entity=ISOCode("AUS"), relation=CAMEOCode("061"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-10-22"), head_entity=ISOCode("AUS"), relation=CAMEOCode("036"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-10-22"), head_entity=ISOCode("AUS"), relation=CAMEOCode("043"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-10-22"), head_entity=ISOCode("AUS"), relation=CAMEOCode("042"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-10-21"), head_entity=ISOCode("AUS"), relation=CAMEOCode("042"), tail_entity=ISOCode("CHN"))
Event(date=Date("2023-10-21"), head_entity=ISOCode("AUS"), relation=CAMEOCode("040"), tail_entity=ISOCode("CHN"))

Agent ReAct Process 2

Thought: The retrieved events indicate a variety of relations between Australia and China. To forecast the future relations, we need to analyze the distribution of these relations to identify the most common types and any recent trends.

Next, we will analyze the distribution of the different CAMEO codes in these events to identify the most frequent interactions. This will help us understand the dominant types of relations and any emerging patterns.

Action:

from collections import Counter

# Extract the CAMEO codes from the retrieved events
cameo_codes = [event.relation.code for event in events]

# Count the frequency of each CAMEO code
cameo_code_counts = Counter(cameo_codes)

# Print the distribution of CAMEO codes
for code, count in cameo_code_counts.items():
    print(f"CAMEO Code: {code}, Count: {count}")
          

Observation:

CAMEO Code: 042, Count: 10
CAMEO Code: 040, Count: 3
CAMEO Code: 036, Count: 8
CAMEO Code: 173, Count: 2
CAMEO Code: 046, Count: 2
CAMEO Code: 061, Count: 3
CAMEO Code: 031, Count: 1
CAMEO Code: 043, Count: 1

Agent ReAct Process 3

Thought: The distribution of CAMEO codes from the historical events between Australia and China reveals that certain types of interactions are more frequent than others. Specifically, the most common relations are:

  • 042: 10 occurrences
  • 036: 8 occurrences
  • 040: 3 occurrences
  • 061: 3 occurrences
  • 046: 2 occurrences
  • 173: 2 occurrences
  • 031: 1 occurrence
  • 043: 1 occurrence

To make a more informed forecast, we should understand what these CAMEO codes represent. We will map these codes to their descriptions to gain insights into the nature of these interactions.

Action:

# Define the CAMEO codes to map
cameo_codes_to_map = ["042", "036", "040", "061", "046", "173", "031", "043"]

# Map the CAMEO codes to their descriptions
relations = [map_cameo_to_relation(CAMEOCode(code)) for code in cameo_codes_to_map]

# Print the descriptions of the CAMEO codes
for relation in relations:
    print(f"CAMEO Code: {relation.cameo_code}, Name: {relation.name}, Description: {relation.description}")

Observation:

CAMEO Code: 042, Name: Make a visit, Description: Travel to another location for a meeting or other event.
CAMEO Code: 036, Name: Express intent to meet or negotiate, Description: Offer, promise, agree to, or otherwise indicate willingness or commitment to meet, visit, or engage in talks or negotiations.
CAMEO Code: 040, Name: Consult, not specified, Description: All consultations and meetings, not otherwise specified.
CAMEO Code: 061, Name: Cooperate economically, Description: Initiate, resume, improve, or expand economic exchange or cooperation.
CAMEO Code: 046, Name: Engage in negotiation, Description: Negotiate or bargain with others.
CAMEO Code: 173, Name: Arrest or detain, Description: Legal or extrajudicial arrests, detentions, or imprisonments.
CAMEO Code: 031, Name: Express intent to engage in material cooperation, Description: Offer, promise, agree to, or otherwise indicate willingness or commitment to engage in or expand material cooperative exchange, such as express intent to cooperate economically, militarily, on judicial matters, and on intelligence.
CAMEO Code: 043, Name: Host a visit, Description: Host or receive a visitor at residence, office or home country.

Agent ReAct Process 4

Thought: The distribution of CAMEO codes from the historical events between Australia and China reveals that certain types of interactions are more frequent than others. Specifically, the most common relations are:

  • 042 (Make a visit): 10 occurrences – Indicates frequent travel for meetings or events.
  • 036 (Express intent to meet or negotiate): 8 occurrences – Shows a strong intent to engage in talks or negotiations.
  • 040 (Consult, not specified): 3 occurrences – General consultations and meetings.
  • 061 (Cooperate economically): 3 occurrences – Economic cooperation initiatives.
  • 046 (Engage in negotiation): 2 occurrences – Actual negotiations or bargaining.
  • 173 (Arrest or detain): 2 occurrences – Legal or extrajudicial arrests or detentions.
  • 031 (Express intent to engage in material cooperation): 1 occurrence – Intent to cooperate materially.
  • 043 (Host a visit): 1 occurrence – Hosting visitors.

Given this data, we can see that the most frequent interactions involve visits, expressions of intent to meet or negotiate, and economic cooperation. These interactions suggest a pattern of diplomatic engagement and economic collaboration, with occasional legal actions.

To further solidify our forecast, we should examine recent news articles to understand the context and any underlying reasons for these interactions. This will help us identify any emerging trends or significant events that might influence future relations.

Next, we will retrieve recent news articles involving Australia and China to gain contextual insights.

Action:

# Define the date range for recent news articles
recent_start_date = Date("2023-10-01")
recent_end_date = Date("2023-11-02")
recent_date_range = DateRange(start_date=recent_start_date, end_date=recent_end_date)

# Retrieve recent news articles involving Australia and China
news_articles = get_news_articles(date_range=recent_date_range, head_entities=[aus_code], tail_entities=[chn_code])

# Print the retrieved news articles
for article in news_articles:
    print(article)

Observation:

(Date(date="2023-11-02"), "No quick fix to remaining tariffs ahead of China visit")
(Date(date="2023-11-02"), "China's aggression in Indo-Pacific leads to 'concern and angst': Peter Dutton, Australia's leader of opposition")
(Date(date="2023-11-02"), "Australia and China eye new ways to heal old wounds")
(Date(date="2023-11-02"), "Albanese must aim beyond bilateral outcomes on Beijing visit")
(Date(date="2023-11-02"), "Western Australian Premier Announces Visit to China")
(Date(date="2023-11-02"), "Australian PM's China visit will help normalize relations, says ex-trade minister")
(Date(date="2023-11-02"), "Nicholas Moore on how to get in early on the next big growth story")
(Date(date="2023-11-02"), "PM's China trip cements new era for ties")
(Date(date="2023-11-02"), "PM to take a hard line on trade with China")
(Date(date="2023-11-01"), "Australian Prime Minister Anthony Albanese to raise imprisoned democracy blogger during China visit")

Agent ReAct Process 5

Thought: The recent news articles provide valuable context for the relations between Australia and China. Key themes include:

  • Diplomatic Visits and Engagements: Multiple articles highlight the upcoming visit of Australian Prime Minister Anthony Albanese to China. This aligns with the frequent occurrence of CAMEO code 042 (Make a visit).
  • Economic Cooperation and Trade: Discussions around trade and economic cooperation are prominent, indicating ongoing efforts to improve economic ties. This supports the relevance of CAMEO codes 061 (Cooperate economically) and 036 (Express intent to meet or negotiate).
  • Concerns and Tensions: Articles mention concerns about China"s aggression in the Indo-Pacific and human rights issues, suggesting underlying tensions. This is reflected in the occasional occurrence of CAMEO code 173 (Arrest or detain).

Given the historical data and recent news context, the most likely relations that Australia will take towards China on November 03, 2023, include diplomatic visits, economic cooperation, and negotiations, with a possibility of addressing underlying tensions.

Final Answer:

{
  "03": ["031", "036"],
  "04": ["040", "042", "043", "046"],
  "06": ["061"],
  "17": ["173"]
}

Action: Final Answer:

{
  "03": ["031", "036"],
  "04": ["040", "042", "043", "046"],
  "06": ["061"],
  "17": ["173"]
}

Agent Final Status

End State: Final Answer

Number of Steps Taken: 5

Final Answer:

{
  "03": ["031", "036"],
  "04": ["040", "042", "043", "046"],
  "06": ["061"],
  "17": ["173"]
}

We show the corresponding relation names of the predicted codes here for a better review:

{
  "Express intent to cooperate": ["Express intent to engage in material cooperation", "Express intent to meet or negotiate"],
  "Consult": ["Consult, not specified", "Make a visit", "Host a visit", "Engage in negotiation"],
  "Engage in material cooperation": ["Cooperate economically"],
  "Coerce": ["Arrest or detain"]
}

Ground Truth Answer:

{
  "03": ["036"],
  "04": ["042"],
  "17": ["173"]
}

We show the corresponding relation names of the ground-truth codes here for a better review:

{
  "Express intent to cooperate": ["Express intent to meet or negotiate"],
  "Consult": ["Make a visit"],
  "Coerce": ["Impose administrative sanctions"]
}

BibTeX

@misc{ye2024miraievaluatingllmagents,
      title={MIRAI: Evaluating LLM Agents for Event Forecasting}, 
      author={Chenchen Ye and Ziniu Hu and Yihe Deng and Zijie Huang and Mingyu Derek Ma and Yanqiao Zhu and Wei Wang},
      year={2024},
      eprint={2407.01231},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.01231}, 
}