This case study explores the relationship between three major field crops (Soybeans, Corn, and Wheat) and climate factors, specifically average rainfall and temperature in New Jersey from 2010 to 2024.
The objective is to analyze whether year-to-year changes in rainfall or temperature influence crop yields, and to visualize trends and relationships across time.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
crops <- read.csv("C://Users//micah//OneDrive//Documents//soybean_case_study//final_cleaned//crops_cleaned")
rain <- read.csv("C://Users//micah//OneDrive//Documents//soybean_case_study//final_cleaned//total_p")
temp <- read.csv("C://Users//micah//OneDrive//Documents//soybean_case_study//final_cleaned//total_t")
The datasets used in this case study were sourced from USDA/NASS (crop yields) and NOAA (rainfall and temperature). Multiple tools were used to clean and prepare the data for analysis:
The temperature and precipitation data were originally provided as 14 separate data sets for each year, each containing unneeded header information in the top three rows. To efficiently clean all 28 datasets, a custom Google Sheets script was used in the Script Editor.
The script looped through each dataset and removed the top three rows, ensuring that only relevant data remained. This approach allowed for consistent, reproducible cleaning across all files and facilitated smooth aggregation in BigQuery.
Example snippet of the script:
function removeTopThreeRowsFromAllSheets() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sheets = ss.getSheets();
for (var i = 0; i < sheets.length; i++) {
var sheet = sheets[i];
if (sheet.getLastRow() > 3) {
sheet.deleteRows(1, 3);
}
}
}
After removing unneeded rows, the next step was to ensure that all datasets had clean, consistent column headers suitable for BigQuerys standards.
To do this:
This process standardized all column names, making the data ready for SQL queries and further aggregation in BigQuery.
=LOWER(REGEXREPLACE(REGEXREPLACE(A1,"[^A-Za-z0-9]","_"),"^([0-9])","col_$1"))
Each climate dataset (temperature and precipitation) initially contained only county-level values without a year column. To allow aggregation across multiple years, I added a year column to each table using SQL in BigQuery.
For example, for the 2010 precipitation dataset:
ALTER TABLE `plucky-respect-467323-d9.NJ_agiculture_casestudy_2025.p_2010`
ADD COLUMN year INT64;
UPDATE `plucky-respect-467323-d9.NJ_agiculture_casestudy_2025.p_2010`
SET year = 2010
WHERE TRUE;
Once each precipitation and temperature table contained a
year
column, the next step was to combine all
yearly tables into single aggregated tables for analysis.
In BigQuery, this was accomplished using the UNION ALL
operator, which stacks rows from multiple tables while preserving all
values. For example, for precipitation datasets:
CREATE TABLE `plucky-respect-467323-d9.NJ_agiculture_casestudy_2025.total_p` AS
SELECT * FROM `plucky-respect-467323-d9.NJ_agiculture_casestudy_2025.p_2010`
UNION ALL
SELECT * FROM `plucky-respect-467323-d9.NJ_agiculture_casestudy_2025.p_2011`
UNION ALL
SELECT * FROM `plucky-respect-467323-d9.NJ_agiculture_casestudy_2025.p_2012`
-- (continued for all years through 2024)
To prepare the crop yield dataset for analysis, it was necessary to calculate the total bushels per acre for each crop by year.
In BigQuery, this was accomplished using a SELECT
query
with aggregation:
SELECT
year,
state,
commodity,
SUM(SAFE_CAST(value AS INT64)) AS total_value
FROM `plucky-respect-467323-d9.NJ_agiculture_casestudy_2025.crops`
WHERE data_item LIKE '%BU%'
GROUP BY year, state, commodity
ORDER BY year, commodity;
After cleaning and aggregating the data in BigQuery, the next step was to visualize trends in crop yields over time using RStudio and ggplot2.
ggplot(crops, aes(x = year, y = total_value, color = commodity)) +
geom_line(size = 1.2) +
labs(
title = "Crop Yield Trends in NJ (2010-2024)",
x = "Year",
y = "Yield (BU/acre)",
caption = "Source: USDA/NASS & NOAA"
) +
theme_minimal()
To understand how climate factors may influence crop yields, I examined average precipitation and average temperature against crop yields for Soybeans, Corn, and Wheat from 2010–2024.
First, state-level averages were calculated for each year:
rain_year <- rain %>%
group_by(year) %>%
summarise(rainfall = mean(value, na.rm = TRUE))
temp_year <- temp %>%
group_by(year) %>%
summarise(temperature = mean(value, na.rm = TRUE))
combined <- crops %>%
left_join(rain_year, by = "year") %>%
left_join(temp_year, by = "year")
ggplot(combined, aes(x = rainfall, y = total_value, color = commodity)) +
geom_point(size = 2) +
labs(
title = "Relationship between Rainfall and Crop Yields",
x = "Average Precipitation (inches)",
y = "Yield (BU/acre)",
caption = "Source: USDA/NASS & NOAA"
) +
theme_minimal()
ggplot(combined, aes(x = temperature, y = total_value, color = commodity)) +
geom_point(size = 2) +
labs(
title = "Relationship between Temperature and Crop Yields",
x = "Average Temperature (F)",
y = "Yield (BU/acre)",
caption = "Source: USDA/NASS & NOAA"
) +
theme_minimal()
To understand how crop yields change annually, I calculated year-over-year changes for each crop, as well as changes in rainfall and temperature.
combined <- combined %>%
arrange(commodity, year) %>% # Sort by crop and year
group_by(commodity) %>%
mutate(
yield_change = total_value - lag(total_value),
rainfall_change = rainfall - lag(rainfall),
temp_change = temperature - lag(temperature)
) %>%
ungroup()
ggplot(combined, aes(x = year, y = yield_change, color = commodity)) +
geom_line(size = 1.2) +
geom_point(size = 2) +
labs(
title = "Year-to-Year Change in Crop Yields",
x = "Year",
y = "Change in Yield (BU/acre)",
color = "Crop",
caption = "Source: USDA/NASS & NOAA"
) +
theme_minimal()
To clearly display the annual changes in crop yields and compare commodities, a dodge bar plot was created using ggplot2:
ggplot(combined, aes(x = factor(year), y = yield_change, fill = commodity)) +
geom_col(position = "dodge") +
labs(
title = "Yearly Change in Crop Yields",
x = "Year",
y = "Yield Change (BU/acre)",
caption = "Source: USDA/NASS & NOAA"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This project was an opportunity to take messy, fragmented datasets from multiple sources (USDA/NASS and NOAA) and transform them into a structured, analyzable format. I built custom cleaning scripts in Google Sheets, standardized headers for BigQuery, aggregated data with SQL, and then performed exploratory analysis and visualization in R.
From the analysis itself, I found that while New Jersey crop yields fluctuate year to year, neither rainfall nor temperature alone strongly explains yield variation. Soybeans and wheat showed slight positive associations with warmer years, while corn yields were far more volatile, likely influenced by additional factors such as soil quality, irrigation, or technology.
More importantly, this case study demonstrates how I approach complex problems as a data analyst:
While the hypothesis wasn’t strongly confirmed, this project highlights my ability to manage the full data pipeline, from raw collection to final insights. It shows how I can bring structure, clarity, and analytical thought process to complex datasets.