Exploring Alberta Public Data

Posted
Comments None

There is a really interesting data set available here, which contains all of the oil & gas production in Alberta. There are a few cool ideas, as far as what to do with this data.
  • Join the volumetric flow to the pipelines, and create a map of pipeline flows
  • Join the volumetric flow to the wells, and create a map of the well flows
  • Find the price sensitivity/intercept for each producer, and facility.
  • Create network diagrams illustrating how the gas flows
  • Find the producers who react the most to market changes, and who have the highest producing wells
If you don't have time to read this, but just want to see the network diagrams, click here. Looking through the page where the data is available, there are a few files to choose from. Monthly production data, going back to 2015, well license file, well infrastructure file. I found that I had to join the files in order to have enough information. Below is what I gathered from each file:
  • Well License File: Licence Number (used to join), Licensee Name (who, as far as we know the production belongs to).
  • Well Infrastructure File: WellID (used to join), Well Name, Licence Number (used to join), Linked Facility ID (used to join), Linked Facility Name, Linked Facility Operator Name
  • Production Data: Lot's of information about the facilities that the wells like to. In order to join the last two tables to this, we will need to filter it for the product that we want (for example gas, or oil), and we only care about ActivityID's of PROD. Once we apply those filters, we can join using the FromToID. If the ActivityID is PROD, the FromToID, is a FromID.
If you want to follow along using R, then download all of those CSV files, and type the following:
library(dplyr)
library(ggplot2)
#Read the Well Infrastructure File for the Licensee, and WellID
inf <- read.csv("production/Well Infrastructure.CSV")
#Read the Well Licence file for the Licence Number and the WellID
wells <- read.csv("C:/Users/user/My Documents/production/Well Licence.CSV")
#Take only the columns I need from the wells data frame
lic <- wells[,c(2,3,6,7,8)]
#Take the columns I need from the infrastructure data frame
id_map <- inf[,c(1,12,18,19,21,22,23,28)]
welldata <- left_join(id_map, lic, by =c("LicenceNumber" = "LicenceNumber"))
month <- 01
production <- read.csv(paste("production/Vol_2018-",sprintf("%02d", month), ".CSV",sep =""))
  prod <- subset(production, ActivityID == "PROD")
  prod$Volume <- as.numeric(as.character(prod$Volume))
  prod$FromToID <- as.character(prod$FromToID)
  #Gas
  gasprod <- subset(prod, ProductID == "GAS")
  gasprod <- gasprod[,c(20,24)]
  welldataprod <- left_join(welldata,gasprod, by=c("WellID" = "FromToID"))
  welldataprod$Volume <- as.numeric(as.character(welldataprod$Volume))
  gasops <- welldataprod[which(!is.na(welldataprod$Volume) & !is.na(welldataprod$LicenseeName)),] %>% group_by(LicenseeName) %>% summarise(volume = sum(Volume), number = n()) %>% ungroup()
  gasops$avg <- gasops$volume / gasops$number
  gasops$month <- month
The linked facility ID, is not always present in the well infrastructure file, so we will lose some volumes due to this. Once I have joined all of those tables together, I could then join them to the gas price in the province, and have a look at the price sensitivity for each producer, and each facility. I would expect the volume produced to be a function of the square root of the price, or volume squared is proportional to the price. I tried price being directly proportional, and price being proportional to the square root of volume, and found the lowest residual error with the first (volume squared proportional to price). To find out what it actually looks like, we will need to read in at least a years worth of data, so I will just make a for loop to rbind the monthly data into one big dataframe.
#Initialise the final data frame
finalgas <- gasops[1,]
for (month in 1:12) {
  #Add for loop for each month here
  production <- read.csv(paste("production/Vol_2018-",sprintf("%02d", month), ".CSV",sep =""))
  prod <- subset(production, ActivityID == "PROD")
  prod$Volume <- as.numeric(as.character(prod$Volume))
  prod$FromToID <- as.character(prod$FromToID)
  #Gas
  gasprod <- subset(prod, ProductID == "GAS")
  gasprod <- gasprod[,c(20,24)]
  welldataprod <- left_join(welldata,gasprod, by=c("WellID" = "FromToID"))
  welldataprod$Volume <- as.numeric(as.character(welldataprod$Volume))
  gasops <- welldataprod[which(!is.na(welldataprod$Volume) & !is.na(welldataprod$LicenseeName)),] %>% group_by(LicenseeName) %>% summarise(volume = sum(Volume), number = n()) %>% ungroup()
  gasops$avg <- gasops$volume / gasops$number
  gasops$month <- month
  finalgas <- rbind(finalgas, gasops)
}
Then get the price and join it.
finalgas <- left_join(finalgas, price[which(price$year == 18),], by=c("month" = "month"))
Then loop through all the companies, and create a linear regression (y=mx+b), for all of the price/volume relationships.
#Find the price sensitivity for each company
for (company in unique(finalgas$LicenseeName)) {
  model <- lm(amp ~ (volume * volume), finalgas[which(finalgas$LicenseeName == company),])
  finalgas[which(finalgas$LicenseeName == company),]$sensitivity <- coef(model)[2]
  finalgas[which(finalgas$LicenseeName == company),]$intercept <- coef(model)[1]
  }
We can just check that overall the model makes sense. Summarise the finalgas dataframe by month, and plot the price vs volume.
bymonth <- finalgas %>% group_by(month) %>% summarise(volume = sum(volume), price = mean(price)) %>% ungroup()
Overall that looks similar to what was expected. Now you can sort/filter/view the finalgas data frame to see at what price the facilities operate at their lowest flow. The most interesting of these facilities, would be the gas storage facilities, as they would focus on the price, whereas other producers may pay no attention to it. You can see how reactive the facilities are to the price based on the sensitivity variable. The product will often pass through several facilities which could affect price sensitivity, so it would be useful to aggregate by the final facility. We could use a python script to chase from one facility to another and come up with the final destination facility (that I will call lastFacilityID), for each row (well) in our table. To do that, let’s first output a list of the facilities we are interested in.
write.csv(welldataprod[,4], "facilitylist.csv", row.names=FALSE)
We should also output a list of all the production with the ActivityID "REC", and the FromToID column, which would be a FromID when the ActivityID is REC.
received <- subset(prod, ActivityID “REC” & ProductID == “GAS”)
write.csv(received, “received.csv”, row.names=FALSE)
Now we can jump over to python, and write a small script to:
  • Create a dictionary recording where a facility sends it’s produc, we will call this receivedDict
  • Run through the list of facilities we are interested in (1st level facilities, from the facilitylist.csv file). For each facility keep going to the next facility as defined by the receivedDict, until the current facility is not in the receivedDict, then add that current facility to the finalFacility dictionary.
  • Output a CSV of the last facilities, based on the final facility dictionary.
  • The below code will do all of this:
    #Read in the received file, that we will chase through to find the final facility.
    #“OperatorName”,“ReportingFacilityID”,“ReportingFacilityName”,“ReportingFacilityLocation”,“FromToID”,“FromToIDType”
    #“IMPERIAL OIL RESOURCES LIMITED”,“ABBT0051211”,“IMPERIAL MASKWA BATTERY 10-12”,“10-12-065-04W4”,“ABGE”,“GE”
    #Column 1 is the Facility ID, Column 4 is the From ID
    #”“,“ProductionMonth”,“OperatorBAID”,“OperatorName”,“ReportingFacilityID”,“ReportingFacilityProvinceState”,“ReportingFacilityType”,“ReportingFacilityIdentifier”,“ReportingFacilityName”,“ReportingFacilitySubType”,“ReportingFacilitySubTypeDesc”,“ReportingFacilityLocation”,“FacilityLegalSubdivision”,“FacilitySection”,“FacilityTownship”,“FacilityRange”,“FacilityMeridian”,“SubmissionDate”,“ActivityID”,“ProductID”,“FromToID”,“FromToIDProvinceState”,“FromToIDType”,“FromToIDIdentifier”,“Volume”,“Energy”,“Hours”,“CCICode”,“ProrationProduct”,“ProrationFactor”
    #“1317”,“2018-12”,“0007”,“IMPERIAL OIL RESOURCES LIMITED”,“ABBT0051211”,“AB”,“BT”,51211,“IMPERIAL MASKWA BATTERY 10-12”,344,“IN-SITU OIL SANDS”,“10-12-065-04W4”,10,12,65,4,4,“2019-01-18”,“REC”,“GAS”,“ABGE”,“AB”,“GE”,”“,“272.2”,NA,”“,NA,”“,NA
    receivedDict = {}
    f = open(“received.csv”, “r”)
    for line in f.readlines():
    	lov = line.split(“,”)
    	if len(lov) > 20:
    		fromFacility = lov20.strip().strip(‘”’).strip()
    		currentFacility = lov4.strip().strip(‘”’).strip()
    		receivedDict[fromFacility] = currentFacility
    f.close()
    
    #Now read the list of first level facilities from the first level summary file:
    #“LinkedFacilityName”,“volume”,“number”,“avg”,“month”
    #” ACL ANTE CREEK 06-09 OIL BATTERY”,2165.2,7,309.314285714286,12
    #Column 
    initialFacilities = []
    f = open(“facilitylist.csv”, “r”)
    for line in f.readlines():
    	if len(line) > 3:
    		initialFacilities.append(line.strip(‘”’))
    finalFacDict = {}
    facLinkDict = {}
    reverseFacDict = {}
    for facility in initialFacilities:
    	currentFacility = facility
    	facLinkDict[facility] = []
    	count = 0
    	while(currentFacility in receivedDict.keys()):
    		#print count
    		count += 1
    		if count > 50:
    			break
    		if currentFacility  receivedDict[currentFacility]: 
    			break
    		currentFacility = receivedDict[currentFacility]
    		facLinkDict[facility].append(currentFacility)
    	finalFacDict[facility] = currentFacility
    	if currentFacility not in reverseFacDict.keys():
    		reverseFacDict[currentFacility] = [facility]
    	else:
    		reverseFacDict[currentFacility].append(facility)
    f.close()
    f = open("lastFacility.csv","w")
    for facility in finalFacDict.keys():
        f.write("{0},{1}\n".format(facility, finalFacDict[facility]))
    f.close()
    
    Now we are ready to read that into R, and join it up to the gasprod data frame. The way I did it, is not very R like, and it's slow, but it works.
    for (facility in unique(finalgas$LinkedFacilityName)) {
      finalgas[which(finalgas$LinkedFacilityName  facility),]$firstOperator <- rep(as.character(welldata[which(welldata$LinkedFacilityName  facility)[1],]$LinkedFacilityOperatorLegalName), each=nrow(finalgas[which(finalgas$LinkedFacilityName  facility),]))
      finalgas[which(finalgas$LinkedFacilityName  facility),]$lastFacility <- rep(as.character(lastFacility[which(lastFacility$X  facility)1,]$X.1), each = nrow(finalgas[which(finalgas$LinkedFacilityName == facility),]))
    }
    

    This blog post is getting a bit long, now that you can re-produce the source of the data, I will share a few visualisation results.

    I created a python script to add all the well flows onto pipelines on the way between the well and the final facility. This of course took a very long time (around 40 hours). However, the only common variables between the pipeline shapefile, and the well production is the surface location. I loaded every pipeline with start/end surface locations, and traced every connection and only add the volumes to the pipelines which reach the final destination. Even doing this, there are still errors, because there will often be multiple pipeline routes through the same surface locations. Until there are more fields to join by, I don't think this is accurate enough. Below is the result. I also joined the well production into postgis, then exported a shape file, and opened it in QGIS. There are lot's of fields to join by, so there should be no errors in this.

    The final thing, is another python script that went through each facility to the final facility, and made a diagram for it, which you can view here, along with the Y-Intercept for the linear regression model, which should be the facilities break even price, in $/GJ assuming that production is elastic.

Author

Comments

There are currently no comments on this article.

Comment

Enter your comment below. Fields marked * are required. You must preview your comment before submitting it.





← Older Newer →