Finding Climate Targets with LLMs, Part 2 - LLM-Only Approach
Measuring LLM performance on identifying corporate climate targets without RAG.
· 15 min read
Introduction #
In my previous post, I gave a brief overview of the concepts behind finding corporate climate emission targets. In this second part, I want to evaluate OpenAI’s GPT-5.x family at determining climate targets for given companies and years, but without any assistance from RAG.
This ensures that we can get an idea of well the model “already knows” the answers without RAG. Specifically, I need to ask the LLMs to tell me the climate emissions targets for some pre-defined companies and reporting years, and measure their performance against that known data.
I broke this down into two parts:
- Establishing an evaluation set data set;
- Prompting the LLMs to answer questions against that data set.
For those without the patience to go through everything, the results show that all three models performed similarly conservatively for the given prompt, successfully avoiding false assertions, which led to relatively low recalls of around 0.3 but paired with precisions of at least 0.87. GPT-5.1 performed the best in this assessment, followed closely by GPT-5.2. GPT-5 performed notably worse.
Methodology #
Harvesting data #
I first had to establish which companies and reporting years to include in the analysis. I am trying to determine if LLMs can determine the targets that existed for a company just by specifying their identity (e.g. company name) and a year.
I would need companies with a high likelihood of reporting climate targets in publicly available documents - there would be no point trying to find climate targets for companies that do not have or report them. This meant looking at listed companies, and the S&P500 seemed like a good place to start.
Whichever companies I chose, I would have to go through them, collect documentation and then create a reference or evaluation set of targets from them. By that, I mean I would need to go through their collected documents and manually determine what targets are in there to establish the ground truth. Constructing this ground truth obviously requires a huge amount of manual labour and concentration to ensure absolute recall and precision, and I needed to be realistic about the scale I could achieve as one person.
Therefore, I settled on analysing the “Magnificent Seven” (Mag7) group of companies. As of November, 2025, the Magnificent Seven represented over a third of the S&P 500 and comprised the following companies:
- Alphabet Inc. (GOOGL)
- Amazon.com, Inc. (AMZN)
- Apple Inc. (AAPL)
- Microsoft Corporation (MSFT)
- Meta Platforms, Inc. (META)
- NVIDIA Corporation (NVDA)
- Tesla, Inc. (TSLA)
I concluded that starting with the reporting data for these 7 companies from the last 2 years of reporting would suffice. If not, I could always add more years and companies. Since corporate reporting year looks back one year in hindsight, I selected 2024 and 2023 as the reporting years.
I also decided to limit the scope of the analysis to carbon emission targets only, and excluded other climate-related targets like renewable electricity or supplier engagement targets. These would require separate schema and add unnecessary complexity.
In the previous post, I described how corporate targets are typically reported via CDP, SBTi and general corporate disclosures.
CDP
In my previous post, I talked about starting with CDP data to obtain a company’s climate-related data. As mentioned, many see it as the “gold standard” of corporate climate-related reporting data. Only a subset of companies report to CDP, because of the high cost of producing their reports to the required standard. The reports, or “disclosures”, follow a well-thought-out structured format and can be parsed using regular expressions, without need for LLMs. However, CDP requires having a data license to access and use them, meaning that I can only use CDP disclosures if they have been self-published by the companies making them.
SBTi
SBTi make their information available freely for use in snapshot export spreadsheets on their website. I obtained an export of data on 14.10.2025 from their dashboard.
| Ticker | Company name | Targets status | Commitments status |
|---|---|---|---|
| NVDA | NVIDIA Corporation | Scope 1, 2, 3 targets (published: 2025-07-19) | — |
| MSFT | Microsoft Corporation | Scope 2, 3 targets (published: 2019-09-23) | Commitment removed (published: 2020-01-01) |
| AAPL | Apple Inc. | Scope 1, 2, 3 targets (published: 2021-05-06) | Commitment set to BA1.5 (published: 2021-04-22) |
| GOOGL | Alphabet Inc. | Scope 1, 2, 3 targets (published: 2022-09-29) | — |
| AMZN | Amazon.com, Inc. | No SBTi targets | Commitment removed (published: 2020-06-18) |
| META | Meta Platforms, Inc. | Scope 1, 2, 3 targets (published: 2024-02-29) | Commitment set to BA1.5 (published: 2020-09-17) |
| TSLA | Tesla, Inc. | No SBTi targets | Commitment removed (published: 2021-10-26) |
Table 1: SBTi data for Mag7 companies
Table 1 shows the SBTi data for the Mag7 companies on the collection date. Note that targets can exist for these companies outside of this SBTi purview. Both AMZN and TSLA could have set climate targets outside of the SBTi. Therefore, we still need to look at investor reporting, and SBTi can simply inform us of what we might expect there.
Investor Reporting
Publicly-listed companies ought to report annually to their investors what the company has achieved and its plans for the future. These documents can number hundreds of pages, with complex formatting, and can confound even human readers. Many companies report on their ESG performance too, sometimes within separate sustainability reports or within a consolidated annual report. This report usually resides on a dedicated part of a company’s corporate website, sometimes under investor-specific areas of the sites, typically as PDFs.
For each year and company, I had to collect documents that had a reasonable likelihood of containing climate targets. In the past, I have written web crawlers to assist this process. This time I experimented with a new approach. I used a structured search prompt to help automate the retrieval primary-source disclosures for each company-year. The objective was to obtain high signal-to-noise and prioritize documents that are likely to contain explicit emissions targets, while excluding third-party documents.
This meant preferring sustainability/impact/ESG/climate/TCFD reports, ESG data packs/spreadsheets, and annual reports or 10-Ks. I kept the search open to CDP Climate Change Questionnaires.
I began with the following prompt:
Retrieve links to relevant documents (e.g. consolidated or impact annual reports in PDF format, along with related spreadsheet) that are likely to hold details of the climate-related carbon emission targets for the company and reporting year below:
- Company: TSLA Tesla, Inc.
- Fiscal year: 2024 (published in 2025)
I prize high signal-to-noise ratio so provide links to docs with a high likelihood of having carbon emissions targets info only. CDP Questionnaires are a great source of info. "Overall" annual corporate reports can sometimes include relevant information too so include those too.
With the help of ChatGPT, I gradually iterated to using this prompt:
Task
Find only primary-source documents that are highly likely to contain climate-related carbon emissions targets for the specified company and reporting year.
Inputs
- Company (legal name + ticker): {{COMPANY = **AAPL**}} ({{TICKER= **Apple Inc.**}})
- Reporting year (and when published): {{YEAR = **2024**}} (published {{PUB_YEAR = **2025**}})
- Primary domain (if known): {{PRIMARY_DOMAIN}}
**What to return (ranked, most relevant first)**
1. CDP Climate Change Questionnaire (full PDF)
2. Sustainability/Impact/ESG/Climate/TCFD report (PDF)
3. ESG/metrics databook or spreadsheet (XLS/XLSX/CSV) with targets
4. Annual report — include if it plausibly mentions targets. Be optimistic.
5. 10-K — only if it contains explicit emissions targets
**Prioritize domains**
- Company’s official sites (main corporate site, investor relations)
- Regulator filings only if targets are explicitly present (e.g., sec.gov)
**Exclude**
- News/blogs/third-party summaries, vendor decks, paywalled teasers, non-English files, broken PDFs, image-only scans without text.
**Verification**
- Confirm the document covers {{YEAR}} (or sets targets applicable to {{YEAR}} onward).
- Prefer searchable PDFs/spreadsheets.
- If multiple versions exist, keep the latest published in {{PUB_YEAR}} and dedupe.
**Output format (concise)**
Return two parts:
*A) Document hits — one row per document (markdown table), then a JSON array.*
Table columns:
- Source type [CDP | Sustainability/Impact | ESG Data Pack | Annual | 10-K | Other (specify)]
- Title
- Publisher (domain)
- Publication date (ISO)
- Coverage year(s)
- Why relevant (≤12 words, must mention “targets”)
- Direct file URL (PDF/XLS/XLSX/CSV)
- Landing page URL
- Discovery page URL (page where you found the file link)
- Confidence [0–1]
JSON schema:
`[ { "source_type": "", "title": "", "publisher_domain": "", "publication_date": "YYYY-MM-DD", "coverage_years": "YYYY or YYYY–YYYY", "why_relevant": "", "direct_file_url": "", "landing_page_url": "", "discovery_page_url": "", "confidence": 0.0 } ]`
*B) Company reference links — curated list/table of high-value URLs, then a JSON array.*
Table columns:
- Link type [IR home | Sustainability hub | Reporting archive | ESG data portal | CDP company page | SEC filings | Climate/TCFD page | Policies | Other (specify)]
- Title
- URL
- Why useful (≤8 words)
JSON schema:
`[ { "link_type": "", "title": "", "url": "", "why_useful": "" } ]`
**Search strategy (follow, don’t print**)
- Query patterns:
- `"{{COMPANY}}" (sustainability OR impact OR ESG OR climate OR TCFD) filetype:pdf {{PUB_YEAR}}`
- `"{{COMPANY}}" (ESG data book OR metrics OR KPI OR dataset) (xlsx OR xls OR csv)`
- `site:cdp.net "{{COMPANY}}" "Climate Change Questionnaire" filetype:pdf`
- `site:{{PRIMARY_DOMAIN}} (sustainability OR impact OR ESG OR climate) (filetype:pdf OR xls OR xlsx)`
- `site:ir.{{PRIMARY_DOMAIN}} (sustainability OR impact OR ESG OR climate)`
- `site:sec.gov "{{COMPANY}}" 10-K {{PUB_YEAR}}` (only to confirm explicit targets)
- For each retained document: keep both the direct file URL and the exact discovery page URL where the link was surfaced (index, archive, blog post, release page, etc.).
- Open top 10 results per pattern; keep max 8 final docs total (quality > quantity).
**Hard cap**
Return at most 8 document hits; omit anything with low likelihood of targets.
In addition to this search-driven discovery, I manually inspected each corporate site for any remaining documents. I found the search-driven discovery to work well. I would like to refine and automate it further into an agent, with some evaluation. That remains out of scope here.
In total, I collected around 650MB of relevant documents for the MAG7 companies for the two reporting years.
The only publicly available CDP questionnaires found were for the financial year 2023 (published in 2024) for Alphabet, Apple and Microsoft.
Defining structured targets #
Targets by nature are often described in unstructured text. This makes comparisons of targets challenging without turning them into a structured format. For example, at the time of writing, Apple claim on their website, on a page called “Apple 2030”:
- “We’ve committed to reducing our emissions by 75% compared with our 2015 baseline.”
We can turn this into the following more structured representation:
- baseline year: 2015
- target year: 2030
- absolute emissions reduction: 75%
In practice, we need to put more thought into the parameters to sufficiently describe a target. Thankfully, SBTi has a well-conceived parameterised schema for emissions targets data. These include:
- Horizon:
near_term | net_zero - Metric type:
absolute | intensity - Scopes covered:
S1,S2,S3 - Base year
- Target year
- Reduction percent
- Temperature ambition (
1.5C,well_below_2C,2C,unspecified) - Status (
approved,committed,in_validation,expired,unknown) - (and more)
By implementing the full SBTi target schema in a Pydantic format (given below), I could enforce the OpenAI LLMs to produce structured outputs for targets they identified:
TargetHorizon = Literal["near_term", "long_term", "net_zero"]
MetricType = Literal["absolute", "intensity"]
Ambition = Literal["1.5C", "well_below_2C", "2C", "unspecified"]
Status = Literal["approved", "committed", "in_validation", "expired", "unknown"]
TargetType = Literal[
"sbti_near_term",
"sbti_net_zero",
"non_target_claim"
]
class Target(BaseModel):
title: Optional[str] = Field(None, description="Human-friendly label if present in text")
target_type: TargetType = Field(..., description="Classify each item.")
horizon: TargetHorizon = Field(..., description="near_term / long_term / net_zero (SBTi framing)")
metric_type: MetricType = Field(..., description="Absolute or intensity target")
scopes_covered: List[Literal["S1", "S2", "S3"]] = Field(..., description="Which scopes this target covers")
scope3_categories: Optional[List[conint(ge=1, le=15)]] = Field(None, description="If S3: list of GHGP category numbers 1–15, if specified")
ambition: Ambition = Field(..., description="Declared temperature alignment (if stated)")
coverage_pct: Optional[confloat(ge=0, le=100)] = Field(None, description="Share of emissions covered by the target (company-wide or in-scope)")
base_year: Optional[int] = Field(None, description="Baseline year (e.g., 2019)")
target_year: Optional[int] = Field(None, description="Completion year (e.g., 2030)")
reduction_pct: Optional[confloat(ge=0, le=100)] = Field(None, description="Required % reduction vs base (for absolute) or intensity")
base_value: Optional[float] = Field(None, description="If explicitly given (e.g., baseline tCO2e or intensity)"
)
target_value: Optional[float] = Field(None, description="If explicitly given (e.g., target tCO2e or intensity)")
unit: Optional[str] = Field(None, description="Unit for base/target values (e.g., tCO2e, tCO2e/$, %, ktCO2e)")
status: Status = Field(..., description="SBTi status if mentioned")
boundary: Optional[str] = Field(None, description="Organizational/operational boundary (e.g., company-wide, specific BU/geography)")
notes: Optional[str] = Field(None, description="Any qualifiers (market-based/location-based, exclusions, etc.)")
sources: List[str] = Field(..., description="Doc names/URLs/node IDs where this target was found")
class ExtractedTargets(BaseModel):
company: Optional[str] = None
targets: List[Target] = Field(
default_factory=list,
description="All SBTi-style targets found in the context"
)
I did not adhere entirely faithfully to the SBTi framework. For example, I added optional qualifiers (e.g. boundary, market-/location-based notes) and provenance metadata (e.g. document identifiers, node IDs). I also left out items like “year_type”, which I did not feel contribute to the task here. Other climate-related targets (e.g., renewable energy procurement goals, efficiency initiatives, supplier engagement) should be labelled “non-target claims” and excluded from scoring.
Creating an evaluation corpus #
After defining the target schema, I had the necessary basis for creating an evaluation set/corpus. As outlined earlier, creating an evaluation data set involves manually inspecting every PDF collected (for all Mag7 companies, for both 2023 and 2024) in order to find every climate target within them - and then producing a manually annotated JSON reference set for those targets. Obviously this is a considerable amount of work.
Since we can see that each target can consist of a great many individual data-points, I chose to only cover a prioritised subset for the sake of evaluation, namely just these experimental fields:
- horizon
- metric_type
- scopes_covered
- base_year
- target_year
- reduction_pct
- ambition
- target_value
- unit
This would cut down the annotation work and still work as a representative model for extracted target data. I chose not to simplify the Pydantic model for the reason that I did not want to potentially confuse the LLM by asking it to compress target information into an insufficient model. I felt it safer to leave it able to output into a rich and expressive one and then assess the performance on the prioritised subset of experimental fields afterwards.
Having everything defined, I generated a JSON format from the Pydantic schema and proceeded to manually annotate the evaluation set paying close attention to ensure the values for the experimental fields were correct. In total, that produced a total of 202 data points that the LLMs could be assessed on over for the whole corpus, as shown in Table 2.
| Field | Count |
|---|---|
| ambition | 34 |
| base_year | 15 |
| horizon | 34 |
| metric_type | 34 |
| reduction_pct | 14 |
| scopes_covered | 34 |
| target_year | 29 |
| target_value | 4 |
| unit | 4 |
Table 2: breakdown of manually annotated experimental corpus by field
As Table 2 shows, not every parameter is equally likely to be included for a target, which highlights the variability in their reporting.
Experimental process & metrics #
Having created an evaluation corpus, I made a python notebook (linked below) to conduct the experiment. It would work by by instruct the LLMs to generate targets per given company and reporting year according to the structured format. These generated targets would then be compared to the corresponding “true” evaluation target and an experimental metric calculated to convey how well the LLM performed.
I began by understanding how best to judge agreement between the LLM output and the evaluation targets. Then I could understand how to convey that agreement through an experimental metric. I looked to some experimental literature for inspiration, such as the Ragas docs. I had already defined targets as composed of individual discrete fields with specific values that can either be correct or incorrect. Therefore, I could base agreement on a TRUE or FALSE comparison using a LLM-as-judge approach.
Each match between output and evaluation targets would count as a true positive (TP), while each unmatched prediction field would count as a false positive (FP), and each unmatched evaluation field as a false negative (FN). Based on this, and keeping this simple, precision and recallseemed entirely appropriate and straightforward. Other options had no advantage and seemed unnecessarily complex.
These “micro” precision, recall and the F1 score (harmonic mean of precision and recall) calculations measure how many correct targets did we generate, and how many did we miss or hallucinate? This calculation includes for every pair, known as the counting across all documents. In aggregate:
- Precision = TP / (TP + FP)
- Recall = TP / (TP + FN)
- F1 = 2TP / (2TP + FP + FN)
In addition to these, I explicitly added a hallucination rate, defined as the proportion of false predictions against total predictions (false assertions):
- Hallucination rate = FP / (TP + FP)
The code also includes an implementation of another “macro” accuracy metric which looks only at matched pairs and calculates their correctness. An LLM judges the quality of each match as one of (EXACT, PARTIAL, WRONG) and a corresponding score is given of (1.0, 0.5, 0), with final averages of per-document field averages given. This was included as a supporting metric and very much secondary in the analysis, particularly as it does not relate to recall.
The code in the notebook orchestrates all of the above. For each LLM variant, it ran through five times to reveal any instability in the LLM-as-a-judge mechanism.
Repository #
I set up the following repo for the code, results and other documentation:
https://github.com/y-f-a/climate-target-extraction
I will continually update this with related work in the future.
Results #
Table 3 shows that all three models performed similarly conservatively, with high precision and low recall:
| Model | Micro precision (mean ± std) | Micro recall (mean ± std) | Micro F1 (mean ± std) | Hallucination rate (mean ± std) | n |
|---|---|---|---|---|---|
| GPT-5 | 0.8763 ± 0.0939 | 0.3143 ± 0.0853 | 0.4588 ± 0.0989 | 0.1237 ± 0.0939 | 5 |
| GPT-5.1 | 0.9679 ± 0.0439 | 0.3786 ± 0.0741 | 0.5401 ± 0.0759 | 0.0321 ± 0.0439 | 5 |
| GPT-5.2 | 0.9464 ± 0.0496 | 0.3571 ± 0.0437 | 0.5169 ± 0.0435 | 0.0536 ± 0.0496 | 5 |
Table 3: performance metrics per OpenAI model.
This means that models avoided stating many claims unless they were fairly confident, which follows the conditions in the prompt. I did not vary the prompt, which would require another set of optimisation iterations.
Also note that the standard deviation values represent the noise/instability in the “LLM-as-a-judge” and not model stochasticity.
GPT-5 is both more willing to assert incorrect things and not better at covering the true items. GPT-5.1 is the clear leader in metrics, with the highest precision and recall. Meanwhile, GPT-5.2 is a close second.
However, I would treat the ordering as tentative unless we have more runs, since the differences are based only on n=5 with noticeable standard deviation.
Lastly, I used the previously-defined macro metrics purely for diagnostics. All of the produced experiment data is included in the provided repo.
Next Steps #
The next stage of inquiry will involve repeating this task with a baseline RAG approach to get a measure of its relative performance.