case_study_crops.knit

1 Introduction

This case study explores the relationship between three major field crops (Soybeans, Corn, and Wheat) and climate factors, specifically average rainfall and temperature in New Jersey from 2010 to 2024.

The objective is to analyze whether year-to-year changes in rainfall or temperature influence crop yields, and to visualize trends and relationships across time.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(dplyr)

crops <- read.csv("C://Users//micah//OneDrive//Documents//soybean_case_study//final_cleaned//crops_cleaned")
rain  <- read.csv("C://Users//micah//OneDrive//Documents//soybean_case_study//final_cleaned//total_p")
temp  <- read.csv("C://Users//micah//OneDrive//Documents//soybean_case_study//final_cleaned//total_t")

2 Data Cleaning and Preparation

The datasets used in this case study were sourced from USDA/NASS (crop yields) and NOAA (rainfall and temperature). Multiple tools were used to clean and prepare the data for analysis:

2.1 Google Sheets Script Editor

The temperature and precipitation data were originally provided as 14 separate data sets for each year, each containing unneeded header information in the top three rows. To efficiently clean all 28 datasets, a custom Google Sheets script was used in the Script Editor.

The script looped through each dataset and removed the top three rows, ensuring that only relevant data remained. This approach allowed for consistent, reproducible cleaning across all files and facilitated smooth aggregation in BigQuery.

Example snippet of the script:

function removeTopThreeRowsFromAllSheets() {
  var ss = SpreadsheetApp.getActiveSpreadsheet();
  var sheets = ss.getSheets();
    for (var i = 0; i < sheets.length; i++) {
    var sheet = sheets[i];
        if (sheet.getLastRow() > 3) {
      sheet.deleteRows(1, 3);
    }
  }
}

2.2 Preparing Headers for BigQuery

After removing unneeded rows, the next step was to ensure that all datasets had clean, consistent column headers suitable for BigQuerys standards.

To do this:

A blank second row was added to each sheet.
A custom Google Sheets function converted the row above into a valid header format for BigQuery. This included replacing spaces or special characters with underscores, standardizing capitalization, and ensuring no empty column names.
The functions output was copied and pasted as values into the header row, and the temporary second row was deleted.

This process standardized all column names, making the data ready for SQL queries and further aggregation in BigQuery.

=LOWER(REGEXREPLACE(REGEXREPLACE(A1,"[^A-Za-z0-9]","_"),"^([0-9])","col_$1"))

2.3 Adding Year Columns in BigQuery

Each climate dataset (temperature and precipitation) initially contained only county-level values without a year column. To allow aggregation across multiple years, I added a year column to each table using SQL in BigQuery.

For example, for the 2010 precipitation dataset:

ALTER TABLE `plucky-respect-467323-d9.NJ_agiculture_casestudy_2025.p_2010`
ADD COLUMN year INT64;

UPDATE `plucky-respect-467323-d9.NJ_agiculture_casestudy_2025.p_2010`
SET year = 2010
WHERE TRUE;

2.4 Aggregating Climate Data Across Years

Once each precipitation and temperature table contained a year column, the next step was to combine all yearly tables into single aggregated tables for analysis.

In BigQuery, this was accomplished using the UNION ALL operator, which stacks rows from multiple tables while preserving all values. For example, for precipitation datasets:

CREATE TABLE `plucky-respect-467323-d9.NJ_agiculture_casestudy_2025.total_p` AS
  SELECT * FROM `plucky-respect-467323-d9.NJ_agiculture_casestudy_2025.p_2010`
  UNION ALL
  SELECT * FROM `plucky-respect-467323-d9.NJ_agiculture_casestudy_2025.p_2011`
  UNION ALL
  SELECT * FROM `plucky-respect-467323-d9.NJ_agiculture_casestudy_2025.p_2012`
  -- (continued for all years through 2024)

2.5 Aggregating Crop Yield Data in BigQuery

To prepare the crop yield dataset for analysis, it was necessary to calculate the total bushels per acre for each crop by year.

In BigQuery, this was accomplished using a SELECT query with aggregation:

SELECT
  year,
  state,
  commodity,
  SUM(SAFE_CAST(value AS INT64)) AS total_value
FROM `plucky-respect-467323-d9.NJ_agiculture_casestudy_2025.crops`
WHERE data_item LIKE '%BU%'
GROUP BY year, state, commodity
ORDER BY year, commodity;

3 Visualizing Crop Yield Trends in RStudio

After cleaning and aggregating the data in BigQuery, the next step was to visualize trends in crop yields over time using RStudio and ggplot2.

ggplot(crops, aes(x = year, y = total_value, color = commodity)) +
  geom_line(size = 1.2) +
  labs(
    title = "Crop Yield Trends in NJ (2010-2024)",
    x = "Year",
    y = "Yield (BU/acre)",
    caption = "Source: USDA/NASS & NOAA"
  ) +
  theme_minimal()

Corn shows unusually high yield values from 2010–2015, peaking above 800 BU/acre, then dropping sharply after 2015 with more volatility
Soybeans have more consistent yields, generally ranging between 150–250 BU/acre, with modest fluctuations

4 Exploring the Relationship Between Climate Variables and Crop Yields

To understand how climate factors may influence crop yields, I examined average precipitation and average temperature against crop yields for Soybeans, Corn, and Wheat from 2010–2024.

4.1 Preparing Climate Data

First, state-level averages were calculated for each year:

rain_year <- rain %>%
  group_by(year) %>%
  summarise(rainfall = mean(value, na.rm = TRUE))

temp_year <- temp %>%
  group_by(year) %>%
  summarise(temperature = mean(value, na.rm = TRUE))

combined <- crops %>%
  left_join(rain_year, by = "year") %>%
  left_join(temp_year, by = "year")

ggplot(combined, aes(x = rainfall, y = total_value, color = commodity)) +
  geom_point(size = 2) +
  labs(
    title = "Relationship between Rainfall and Crop Yields",
    x = "Average Precipitation (inches)",
    y = "Yield (BU/acre)",
    caption = "Source: USDA/NASS & NOAA"
  ) +
  theme_minimal()

No strong linear correlation is immediately clear between precipitation and yields
Corn has extreme outliers even at average rainfall levels

ggplot(combined, aes(x = temperature, y = total_value, color = commodity)) +
  geom_point(size = 2) +
  labs(
    title = "Relationship between Temperature and Crop Yields",
    x = "Average Temperature (F)",
    y = "Yield (BU/acre)",
    caption = "Source: USDA/NASS & NOAA"
  ) +
  theme_minimal()

Corn again shows some very high yields across the full temperature range
Soybeans and wheat show a slight upward trend at higher temperatures, yields cluster higher around 55°F compared to 52–53°F
This could suggest moderate warming is favorable for these crops in NJ

5 Year-to-Year Changes in Crop Yields

To understand how crop yields change annually, I calculated year-over-year changes for each crop, as well as changes in rainfall and temperature.

combined <- combined %>%
  arrange(commodity, year) %>%   # Sort by crop and year
  group_by(commodity) %>%
  mutate(
    yield_change = total_value - lag(total_value),
    rainfall_change = rainfall - lag(rainfall),
    temp_change = temperature - lag(temperature)
  ) %>%
  ungroup()

ggplot(combined, aes(x = year, y = yield_change, color = commodity)) +
  geom_line(size = 1.2) +
  geom_point(size = 2) +
  labs(
    title = "Year-to-Year Change in Crop Yields",
    x = "Year",
    y = "Change in Yield (BU/acre)",
    color = "Crop",
    caption = "Source: USDA/NASS & NOAA"
  ) +
  theme_minimal()

The line chart highlights corn’s fluctuations, where highs and lows dominate overall yield variability

6 Visualizing Yearly Crop Yield Changes

To clearly display the annual changes in crop yields and compare commodities, a dodge bar plot was created using ggplot2:

ggplot(combined, aes(x = factor(year), y = yield_change, fill = commodity)) +
  geom_col(position = "dodge") +
  labs(
    title = "Yearly Change in Crop Yields",
    x = "Year",
    y = "Yield Change (BU/acre)",
    caption = "Source: USDA/NASS & NOAA"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Soybean and wheat yields remain relatively more stable, with changes generally staying within ±150 bushels/acre

7 Conclusion

This project was an opportunity to take messy, fragmented datasets from multiple sources (USDA/NASS and NOAA) and transform them into a structured, analyzable format. I built custom cleaning scripts in Google Sheets, standardized headers for BigQuery, aggregated data with SQL, and then performed exploratory analysis and visualization in R.

From the analysis itself, I found that while New Jersey crop yields fluctuate year to year, neither rainfall nor temperature alone strongly explains yield variation. Soybeans and wheat showed slight positive associations with warmer years, while corn yields were far more volatile, likely influenced by additional factors such as soil quality, irrigation, or technology.

More importantly, this case study demonstrates how I approach complex problems as a data analyst:

Cleaning & transforming data at scale (28+ raw climate files into consistent, query-ready tables)
Building reproducible workflows across SQL and RStudio
Testing hypotheses with visualizations to interpret relationships between variables
Communicating insights in a way that connects raw data to real world context

While the hypothesis wasn’t strongly confirmed, this project highlights my ability to manage the full data pipeline, from raw collection to final insights. It shows how I can bring structure, clarity, and analytical thought process to complex datasets.