Local Image

Analyzing OPEC Member Crude Oil Production Quotas

Will Rodman | Data Science | Tulane University

Project Link: https://willcrodman.github.io

Introduction

This project dives into data about OPEC (Organization of the Petroleum Exporting Countries) crude oil production and OECD (Organization for Economic Co-operation and Development) countries from 1960 to 2022. Including providing political context of OPEC and OECD, exploring their respective roles and significance in the global oil trade. The data source for this project is from the official OPEC website, which includes data such as crude oil production, demand, spot prices, refinery throughput, refinery capacity, and OPEC production quotas by country.

Founded in 1960, OPEC comprises a group of petroleum-exporting nations. OPEC was created with the primary aim of asserting collective control over their oil resources and global oil trade. OPEC's founding members included Iran, Iraq, Kuwait, Saudi Arabia, and Venezuela. Over the years, the organization has expanded to include several other member nations, totaling 13.

It is important to focus on OPEC member nations due to their global importance as petroleum exporters. Alongside OECD countries, who are reliant on oil imports to meet their energy needs, and have historically been affected by changes in OPEC production quotas and embargoes. This by fixing production and suppliers, OPECs behaves like as cartel-style supplier.

Data Source: https://asb.opec.org/data/ASB_Data.php

Local Image

OPEC and OCED Organizations

While OPEC consists of major oil-producing nations such as Saudi Arabia, Iran, Iraq, and Venezuela, among others. OECD comprises economically advanced nations, primarily in Europe and North America, which heavily rely on oil imports to fulfill their energy requirements. Currently there are 38 member countries in OECD including the United States and United Kingdom.

Local Image

The primary difference between OPEC and free market nation in OECD is how OPEC acts as a cartel-style supplier. In economics, a cartel supplier is defined "a formal agreement between a group of producers of a good or service to control supply or to regulate or manipulate prices" (investopedia.com).

Crude Oil Production and Demand

Extract, Transform, and Load (ETL) Data

This imports crude oil production and demand for countries world wide (including all OPEC members) from 1960 to 2020. Crude Oil refers to the fossil fuel that exits in Earths geological formations. They will be stored in the project as Pandas DataFrames, along with a third DataFrame that describes countries domestic crude oil demand deficits per year.

$\text{Deficit}_{\text{year}} = \text{Production}_{\text{year}} - \text{Consumption}_{\text{year}}$

Because our two datasets do not have a perfect one-to-one matching, deficits cannot be computed for every country. This mean exploratory data analysis will be for all OPEC member countries and a limited number of OECD member countries. In addition here are the missing years for each country.

Exploratory Data Analysis (EDA)

This analysis visualized production demand over time using a line plot and box plot, visualization are categorized by:

  1. The top five oil producers in 2023.
  2. OPEC member countries.
  3. OECD member countries with accessible data.

Box Plot Distribution

Local Image

The visualization show how OPEC countries have a negative crude oil deficit, while the median sample of OECD have a positive crude oil deficit. It is also show how large economies United States (OECD member) Canada (OECD member), Saudi Arabia (OPEC member) and Russia (OEPC+ member) can influence the distributions.

Oil Refinery Load and Capacity

Extract, Transform, and Load (ETL) Data

This imports crude oil refinery utilization and capacities for countries world wide (including all OPEC members) from 1980 to 2022. A Oil Refinery processes crude oil into various refined products, including gasoline, diesel, and petrochemicals. They will be stored in the project as Pandas DataFrames, along with a third DataFrame that describes countries utilization rates of oil refineries per year.

$\text{Utilization}_{\text{year}} = \frac{\text{Refinery Throughput}_{\text{year}}}{\text{Refinery Capacity}_{\text{year}}}$

Because our second datasets is larger then out first dataset, we do not have perfect one-to-one matching for our third dataset. In addition here are the missing years for each country.

OPEC Countries have more variance is their utilization of oil refineries then OECD countries. A factor of this could be the fact that OPEC sets production quotas for member countries, forcing members to not produce and use oil refineries at the most optimal domestic rate.

Crude Oil Spot Prices

Extract, Transform, and Load (ETL) Data

This imports crude oil spot prices for countries world wide (including all OPEC members) from 1983 to 2022. The Spot Price is the current market equilibrium of a future contract; this is used to determine the current price a future contracts, which is how oil is traded.

Here are the missing years for each country and benchmark.

Exploratory Data Analysis (EDA)

Oil spot prices are categorized by country and benchmark, such that countries can have multiple benchmarks or tack each others. Too analize how spot price benchmarks vary from each other, I normalize the spot prices using the Z-Score then compute a 5-year moving average. This is done for OPEC and OECD countries where spot price data was available.

Observations from the spot price visualizations:

OPEC Crude Oil Production Quotas

OPEC quotas are production limits established for its member countries to regulate the global supply of oil; acting as a cartel supplier in economics. These quotas are set during OPEC meetings, where member nations agree on individual production targets and requires trusting member countries. Historically, due to geopolitical conflict and economic collapse, member countries consistently underproduce and overproduce targets.

Extract, Transform, and Load (ETL) Data

Because OPEC meetings do not happen at scheduled times, we have to transform the quota data into annual sums.

Exploratory Data Analysis (EDA)

There is a lot of missing values in our transformed dataset. Considering this, we will only plot the two nations with the most data available: Saudi Arabia and the United Arab Emirates.

Model: Random Forest Regression of Overproduction

Before model training, I note the following data errors and observations based on ETL:

  1. All nations produce a negative deficit (positive surplus) of crude oil.
  2. Some nations have over 100% utilization rate; this could be an error in my feature engineering.
  3. The majority of OPEC nations are missing over 50% of their quota data; excluding the UAE and Saudi Arabia.

Having tested LinearRegression, PolynomialFeatures, DecisionTree and RandomForest regression models. I discovered that a RandomForest regression model produces the best test accuracy.

Assumptions followed for RandomForest regression model:

  1. There little to zero autocorrelation (correlation over time) in the dataset.
  2. Calculation of production quota by year is correct.
  3. All input features are continuous features.

The regression model will Predict the overproduction of OPEC nation crude oil. The dependent variable being percentage of overproduction with independent variables being, refinery throughput, domestic demand and spot price premium (premium above OPEC Oil Reference Basket). The training and test dataset will only be a combination of annual observations from Saudi Arabia and the United Arab Emirates, this is because these two nation are the largest crude oil producers in OPEC and is the only data that exists across all six ETL datasets.

Letting overproduction be a percentage:

$\text{Overproduction}_{\text{year}} =\frac{ \text{Target}_{\text{year}} - \text{Production}_{\text{year}}}{\text{Production}_{\text{year}}}$

The regression model is defined as:

$\hat{Overproduction}_{\text{year}} = \frac{1}{100} \sum_{i=1}^{100} h_i(x)$

$100 := Number of Decision Trees$

$h_i(x):= Decision Tree Prediction$

By creating this model, it is also be determined what features inputs were most influential in predicting overproduction.

Combining Datasets for Machine Learning

Correlation and Shape of Dataset

Training Random Forest Model

Based on the correlation plots, it appears that domestic demand, refinery capacity and premium have the lowest correlation across the feature space. These three features will be the input used to predict the hypothesis space.

Input descriptions:

  1. demand: A OPEC nations domestic demand for crude oil.
  2. throughput: The amount of crude oil refined into final products domestically by an OPEC nation.
  3. premium: The percentage a nations crude oil spot price is above (or below) the OPEC Oil Reference Basket.

Measuring the Model's Accuracy

Because the model is predicting a percentage value, accuracy will be measured using Mean Absolute Error. This way our model output and error are both percentage values.

$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |Overproduction_i - \hat{Overproduction}_i|$

This model is not significantly accurate at predicting the overproduction percentage. To simplify the measure of accuracy, lets measure the models precision at correctly predicting positive overproduction value.

The model is more precise at correctly predicting when overproduction occurs.

Determining Feature Importance

Finally, the model can used to infer what features are the most important for predicting overproduction.

This shows that features selected ended up having nearly equal importance in fitting the model. With each feature holding approximately one third of weight in the model.