Whack Data

Why Does Pi Show up in the Normal Distribution?

2021-12-06T00:00:00+00:00

While recently looking through an old stats textbook, I came across the familiar equation for the normal distribution:

\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^{2}} \]

Anyone that’s taken a statistics course in university has come across this equation. I had seen it many times myself, but looking at it fresh this time, two questions immediately came to mind:

How exactly does this thing form a normal distribution?
What the hell is $ \pi $ doing in there?

The first question seemed simple enough to figure out: I would just have to trace back the history of the equation and put it together piece by piece. But the second question absolutely stumped me: what in the world does a bell curve have to do with a circle?

I read through all of the Math Stackexchange solutions, searched around, and asked on Twitter, but never felt like any of the answers gave me the intuition I was looking for. They relied too heavily on analytical solutions, or when visual techniques were employed, the connections felt hand-wavy to me. After doing a bit of my own research, here’s my attempt at explaining the connection without resorting to any advanced math.

First, what exactly is a bell curve?

Before we get to the $ \pi $ part, it helps to gain some insight into how exactly a bell curve is formed. Let’s start with the exponential function, which you can see within the equation above. Here it is standing on its own:

\[ f(x) = e^{x} \]

If we square the value of $ x $, it turns into something that looks kind of like a quadratic, but isn’t one. Instead, it’s a function that grows much faster than a quadratic, but has some similar properties such as being symmetric about its lowest point. Adding it to the plot above for comparison, you can see that they have the same value at $ x=0 $ and $ x=1 $:

\[ f(x) = e^{x^2} \]

Finally, let’s make the exponent negative, and like magic, we get the bell curve shown in red below:

\[ f(x) = e^{-x^2} \]

This function, $ f(x) = e^{-x^2} $, is just one particular bell curve of an infinite number of possibilities. In general, you can raise $ e $ to any quadratic you like. However, it is only when that quadratic is concave (that is, it “opens” downwards) that you get a bell curve. Above, that quadratic was $ -x^2 $, which does indeed open downwards.

For example, the equation $ f(x) = x^2 + x + 2 $ plotted in blue below is not concave, and when $ e $ is raised to it, you get the green curve, which is obviously not a bell curve:

If we switch the equation to be $ f(x) = -2x^2 + 3x + 2 $, though, we get a concave function, and $ e $ raised to that forms the bell curve shape:

For this reason, the general equation of a equation of a bell curve is $ e $ raised to a quadratic:

\[ f(x) = e^{\alpha x^2 + \beta x + \gamma} \]

To help constrain it to only concave quadratics, you can perform the following replacements:

\[ \alpha = \frac{-1}{2\sigma^{2}} \] \[ \beta = \frac{\mu}{\sigma^{2}} \] \[ \gamma = \ln(a) - \frac{\mu^2}{2\sigma^2} \]

After you substitue these in and rearrange, you’ll find that you get the following, which is almost exactly the equation we started with at the top, only with an $ a $ in front of it:

\[ f(x) = ae^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^{2}} \textbf{ vs } f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^{2}} \]

The $ a $ is chosen in the equation on the right so that no matter what shape the bell curve takes, the area underneath it is always exactly 1. This is because for a statistical distribution, 1 is equivalent to 100% of the possible outcomes, and the area should always sum to that value.

So, in other words, the connection between the bell curve and that $ \pi $ term must have something to do with the area of the curve itself. But what exactly is that connection?

Before I get to how $ \pi $ is related, let me first state a fact and let you chew on it for a moment: if we return to one of the equations above, $ f(x) = e^{-x^2} $, it turns out that the area under this curve is exactly $ \sqrt{\pi}$.

Let’s take stock of what just happened there. We took a transcendental number, $ e $, and raised it to the power of a quadratic. When we calculate the area under that curve, we get another transcendental number, Pi.

It turns out that these two numbers are related in a few ways, including their relationship in the complex number system via one of the most beautiful equations in math: $ e^{i\pi} + 1 = 0 $. But that equation doesn’t play a role here.

Instead, as we’ll see, $ \pi $ comes out of the way that we have to go about calculating the area. In a roundabout way, we can get this area by working with the square of $ e^{-x^2} $, and then taking the square root. In other words:

\[ \sqrt{( \text{Area of } e^{-x^2})\cdot (\text{Area of } e^{-x^2})} \]

The reason we have to do this has to do with the calculus technique that we need to employ to get the area. There’s plenty of examples online that show how to do this, but I want to instead provide the visual intuition that these analytic solutions don’t necessarily convey.

Since the variable we use to calculate the area is arbitrary, we can just as easily represent the above equation as the following, where we replaced the second $ x $ with a $ y $:

\[ \sqrt{( \text{Area of } e^{-x^2})\cdot (\text{Area of } e^{-y^2})} \]

You can now think of this as putting one of these bell curves on the x-axis and the other on the y-axis, and then getting all combinations of their heights and plotting it in 3 dimensions:

To get the area of one of the curves, you just need to get the volume of the “hill” that forms, and then take the square root of that value. An analogy to this with fewer dimensions is knowing the area of a square, and then getting its side length by taking the square root.

Note: This trick will not work for all types of functions. If you try this with a quadratic (say, $ -x^2 + 9 $), you will not get the correct answer. The reason is that this only works for functions that are rotationally symmetric when they are squared. While the Gaussian is, you can see from a similar plot of the quadratic that it is "boxy" and is not symmetric through rotation the way that the curve above is:

OK, so how do we get the volume of the “hill” above? One way would be to chunk it up into squares like above, and then get the height of each in the middle of the square. You could then calculate the volume of these square pillars as $ (\text{Area of Each Square}) \cdot (\text{Height}) $ and then add up all those smaller volumes. The smaller you make the squares, the better the approximation.

However, this hides where the $ \pi $ comes from. So instead, imagine that instead of using squares, we divide it up radially. In this diagram, we are looking down from the top and we see the contour lines of the hill:

Here, you divide up the hill into “slices” represented by the black dotted lines. Those slices are further divided into pieces as highlighted in blue. As above, you multiply the area of each of these blue pieces by the height of the hill at that point to get the volume.

\[ r \Delta \theta \Delta r \cdot \text{Height} \]

In this case, though, you repeat this along the “slice” to get the volume of the entire slice, and then multiply that by the total number of slices to get the entire volume of the hill.

If you make the angle $ \theta $ small enough so that it’s barely a sliver, then for all intents and purposes, you can multiply the volume of a slice by $ 2 \pi \text{ radians}$, the number of radians in a circle.

If you actually do this math (again, the calculus is covered here for those that want to see it in action) you’ll find that each slice has an area of exactly $ \frac{1}{2} $. Multiplying that by $ 2 \pi \text{ radians}$ and you get a volume that exactly equals $ \pi $.

So there you have it: $ \pi $ comes out of the fact that we find the volume by making radial slices, and then stitching them all together around a circle.

As it turns out, anything that is symmetric through rotation can be thought of as involving circles, and naturally, circles imply that $ \pi $ is lurking somewhere in the math.

While this isn’t a rigorous proof and I skipped over a lot of details (e.g. the jump to the 3D plot of the two bell curves doesn’t generally work for all functions, but it does for the ones we used) I hope that this gives readers an intuition for why $ \pi $ seems to show up out of nowhere in a curve that has seemingly little to do with it.

Converting Geospatial Files Using ogr2ogr in Scala

2017-11-25T00:00:00+00:00

Converting geospatial files from one format to another is one of the most common tasks when starting a new project. This process has been made significantly easier by GDAL’s ogr2ogr, a command line tool for doing exactly that. Using it from within your Scala code isn’t hard, but it can be tricky. I’ll show you how below.

Note

If you'd like to see the full working example ahead of time, see: https://github.com/Brideau/scalaogr2ogr

GDAL Setup

First, since ogr2ogr depends on GDAL, you’ll need the GDAL Java Bindings set up before you get started. See my previous post, Find a Geospatial File’s SRID Using Scala and GDAL, for some guidance on that.

Project Setup

Add the following to your project’s build.sbt file to get the Boundless Geo resolver and to load the GDAL library (I’m running Scala 2.12.4 and SBT 1.0.3 for the record):

Download this sample file and save it in your project under the /src/main/resources folder, creating it if it doesn’t already exist.

Merging the ogr2ogr Java Port with Scala

Next, you have to track down the version of ogr2ogr that has been ported to Java. This is buried inside OSGeo’s Github repo under the SWIG bindings folder, which you can find here. It doesn’t have the prettiest API as you’ll see (even the author admits so in the source code comments), but it does the trick.

Create a new folder in your project src/main/java to store this file, and put it in a package org.gdal.apps to keep things organized. Finally, rename the main class to execute as shown below (this should be on line 99 or so of the file) to keep it from confusing the JVM later.

Calling ogr2ogr

We’ll be using Futures to keep things nice and asynchronous, so you’ll need to import them and a few other things a the top of your class:

Next, add this function that will be used to create a folder to hold your output and to call ogr2ogr to perform the conversion:

Finally, build up your ogr2ogr command using their documentation and the output driver’s specific documentation, just as you would if running it from the command line. Store it as an array of strings. For example, to output as a Shapefile, just Google ogr2ogr driver shapefile to find this page.

Now you can build and run it, and it should create and store a shapefile in a new temp directory within the folder you ran it from. You may see a large number of warnings about the fields being too wide using the example file, but the conversion should still complete successfully.

Passing Additional Parameters

Some drivers, such as the CSV Driver, take a number of additional parameters to configure the output format. These can be added as shown here:

Find a Geospatial File's SRID Using Scala and GDAL

2017-11-23T00:00:00+00:00

On occasion, you might find yourself working with a new spatial data set where it would be really handy to figure out what the current SRID is. Luckily, the smart people at GDAL.org have done the hard work for you. Here’s how you do it in Scala using the GDAL Java Bindings.

Note

If you'd like to see the full working example ahead of time, see: https://github.com/Brideau/findsrid

GDAL Setup

First, you have to set up GDAL with the Java bindings on your machine. It’s not as simple as just having GDAL installed. There’s instructions to do this here for various systems, and here for mac, but in my experience these aren’t very up to date and you may have a date with StackOverflow to get through this part. The following worked for me on macOS High Sierra:

First, disable System Integrity Protection. The rest of this won’t work if you don’t do that.
Run:
Follow the instructions that show in your terminal in the output of the previous command to update your .zshrc or .bashrc or whatever file it is your terminal uses.
Add the following to the same file as in the previous step:
Restart any terminals you have open or run source [the file from above] in your terminal

Project Setup

Add the following to your project’s build.sbt file to get the Boundless Geo resolver and to load the GDAL library (I’m running Scala 2.12.4 and SBT 1.0.3 for the record):

Download this sample file and save it in your project under the /src/main/resources folder, creating it if it doesn’t already exist.

Blocking Approach

The simplest way to do this is to write it without caring about whether your code blocks. To do this, add the contents of the Main class below to whichever class you’re working with, following along with the comments:

That’s it! Build, run, and enjoy. If it’s possible to identify your SRID from the file you have, there’s a good chance the AutoIdentifyEPSG method will do it.

Non-Blocking Approach

To achieve the same thing without blocking, wrap everything in a function that returns a Future when it is called. Then, call the function, and use pattern matching to choose what to do with the function returns successfully or not. Make sure you import the global ExecutionContext at the top and that you block the main thread at the end to prevent the JVM from terminating, as shown below.

This approach has benefit of being able to complete other work while the file is being loaded, which is handy when you’re dealing with large geospatial files.

How Much Income Tax Canadians Pay: A Provincial Comparison

2016-04-27T00:00:00+00:00

Hey there, try turning your screen sideways. I did my best to make the charts below work for small screens, but not this small!

Who pays the least income tax in Canada, and how has that changed over time?

This simple question turns out not to have an easy answer. The combination of hidden income taxes, some province’s reluctance to change their brackets with inflation, and “curvy flat taxes” results in an income tax horse race. Where you’d pay the least amount is completely dependent on how much you earn.

Canadian Provincial + Federal Effective Tax Ranking, 2016

The calculations for this include federal and provincial income tax, Employment Insurance, the Canadian & Québec Pension Plans, BC’s Medical Services Plan, Ontario’s Health Premium, Québec’s Health Services Fund, surtaxes (that is, a tax on how much tax you pay), Québec’s Parental Insurance Plan, the basic exemption, as well as special adjustments to the Federal tax rate for Québec (they pay 83.5% of the federal income tax that other provinces and territories pay). All dollars have been inflation adjusted to 2016 dollars using the same formula used by the federal government to adjust their tax brackets.

To keep things simple, I visualized what these tax rates look like for a single person with no dependents, earning their income through the work they do as opposed to interest on investments. This, of course, is an approximation, but it’s good enough for comparison purposes.

When you combine all of these things together, you get the ‘effective tax rate’ at every income for each province. All those taxes combined form a unique curve that represents how much income tax a resident of each province pays. Below, you can see how these curves have changed since 2005, and since last year. In general, there have been decreases for people earning up to about $200,000 in most provinces, and increases for those earning more than that, with some exceptions.

Comparing the Effective Tax Rates Over Time

Note: This chart does not include the changes to Newfoundland and Labrador's income tax announced in mid-April 2016 that come into affect mid-year.

Instead of raising the income tax, many provinces create new taxes such as health fees, parental insurance fees and surtaxes. The result is a lumpy tax curve - especially noticeable in Ontario - that is indicative of a much more complicated system. Note that Québec is an outlier in the chart below because they pay a reduced federal rate, giving them the ability to collect more tax as a province.

Hidden Income Taxes

In addition, taking this approach gives the appearance of a province having lower income tax, when in reality it is spread over multiple types of tax.

Taxes can also be increased without passing any legislation. Between 2005 and 2015, Prince Edward Island introduced an almost 1% increase in personal income tax for some earners by not changing their tax brackets with inflation.

For example, while the lowest tax bracket in New Brunswick was $32,730 in 2005, it was $39,973 in 2015 because it was adjusted along with inflation each year. PEI, however, almost never changes their brackets, with the lowest bracket only changing from $30,754 in 2005 to $31,984 in 2015 - well below the rate of inflation.

The result of this can be seen in this next chart, where the orange lines represent NB’s effective tax rates in 2005 and 2014 - which almost perfectly overlap - while PEI’s show a large gap between the two years. This tax increase will continue to grow each year unless this policy changes.

How to Increase Taxes Without Increasing Them: Bracket Creep

Sometimes, taxes don’t behave the way you might think. For example, until 2015, Alberta had a single income tax rate for all people: 10%. Intuitively, you may think this would result in a horizontal line where everyone pays the same rate. In reality, it’s as as curved as any other!

The reason is the basic personal amount: the first part of your income on which you don’t pay tax. In Alberta, this amount is $18,451 for 2016, so if you earn $20,000, you only pay tax on $1,549 dollars. This means you pay $155 tax on your $20,000 income, or 0.8% - far from the 10% rate.

Why A Flat Tax Isn't Flat

Note: The gap between the lines is not a plotting error. Alberta adjusted the amount of income people do not need to pay tax on by more than inflation. As a result, everyone's rate declined by about 0.1%.

The income tax system is complicated. While the use of hidden taxes and inflation to raise taxes may be politically beneficial, it makes the system even harder to understand. My hope is that this piece helps to clarify the way that the income tax system works, and makes it easier for that average citizen to make sense of what can be an overwhelming topic.
__
Ryan Brideau

Hey, I’m Ryan Brideau. I work as a Senior Data Scientist at Wealthsimple. Previously, I was at Shopify. You can follow me on Twitter here: @Brideau

Some Notes on Data Sources and Methodology

Tax bracket and credit data are not stored in a single document that shows their changes over time; they are stored in 30 years of PDFs across two government websites (1) (2). To produce these charts, I downloaded the federal, provincial and territorial tax-return forms for the last 10 years (code here). I then compiled that tax bracket and tax credit information into a giant JSON object, cross-referencing with documents from KPMG, Ernst & Young, and the Canadian Tax Payers Federation to validate the data. Once this information was recorded, I wrote a program (code here) that calculated what percentage tax was paid by every earner from $0 to $500,000 income.

The federal government's method for adjusting for CPI is described here, though their definition isn't quite correct. Instead of multiplying the previous-years value by the "A divided by B" they mention, it is simply the value of A divided by B. The source of the CPI values is here.

Line Graphs and Parallel Processing in R

2014-08-04T00:00:00+00:00

A few months ago I came across one of the most beautiful visualizations I have ever seen: James Cheshire’s Population Lines Print. What I love about it is that, in the absence of any traditional map features, the outlines of countries and continents are immediately apparent. And as long as you are familiar with what the land masses of the globe look like, you know exactly what the plot is without even needing to be told. Another interesting feature is that the peaks also give information about both the population and and the density: the area under the graph represents the total population, while the higher the peak, the more dense it is. (Hence the huge peak of Tokyo, and the low, wide peak of Mexico City.)

When I first saw the post, I went to this blog to see how he pulled it off. Unfortunately, there weren’t any details. So, I had to figure it out myself.

Well, I think I got it! I’ll show you a few of the plots first, and then describe the process below.

Using the same NASA SEDAC data that James used, here is the plot, produced entirely using R (click for higher resolution):

And, for my fellow New Brunswickers, here’s the same dataset of NB properties I plotted a while ago, plotted again using a line graph:

And finally, here is the property value data from my last post for Fredericton. Note that, as mentioned above, unless you’re familiar with the geography of the area, it may not be directly apparent what you’re looking at here. For those that are, you can immediately spot the river flowing through the middle of town with the tributaries flowing in:

These maps aren’t the most practical ways of plotting data into a third dimension since you can’t really ‘dive deeper’ into them to get more detail, but they are certainly one of the most beautiful I’ve seen.

So, how was it done?

EDIT: I’ve updated the below instructions since this was originally published to provide an alternative to using QGIS (thanks @Zecca_Lehn for pointing out I could do this in R itself). I also improved the performance about 3x by replacing my data frames with data tables.

Well, like everything else I do on here, I wanted to accomplish this using only freely available software. In this case, I used a combination of R, QGIS and Libre Office (the latter two being optional). The work was divided into three main pieces:

Extract the data from NASA’s data set.
Crunch the data to prepare it for plotting.
Actually do the plotting.

All the code for processing and plotting is available here: GeospatialLineGraphs.

Extracting the Data

James originally used NASA’s population density data from the year 2000. Though the outcome should be the same, I decided to use the raw population data available here. On that page, go to download, select the format as “.ascii”, the resolution as 1/2°, and the year as 2000.

Once you’ve downloaded it, you can do either of the following to get the data into a workable form:

Using R

Put the file glp00ag30.asc into a folder called DataSets in your working directory
Run this part of the R script which uses the ‘raster’ package to turn it into a set of points. The rest of the script is described in the next section.

Using QGIS

Import the data into QGIS
Go to Raster > Conversion > Polygonize (Raster to Vector), and check the “Add to map” box
Select that layer, and go to Vector > Geometry Tools > Polygon Centroids and follow the menu items, making sure to check “Add result to canvas”. This will generate another layer with point data from your polygon data.
Right click the points layer, and save it as a CSV file. It should automatically export the X & Y coordinates.
Finally, load the data with this part of the script and make sure to comment out the part above that uses the ‘raster’ function mentioned above.

There, you data is ready to be crunched.

Processing the Data to be Plotted

Next, you have to take this data and turn it into something that can be plotted. The code to do this is available here: Geospatial Line Graphs - Generate Data.

The basic idea behind what this does is this:

First, find the boundaries of the plot to know the start and end latitude and longitudes you want to plot over.
Using the start and end latitudes, create 200 evenly spaces points of latitude - one for each line that will be plotted.
For each point of latitude, draw a square with sides equal to the gaps between the points of latitude. Calculate the total of whatever data you’re interested in inside that square.
Move the square over an amount of longitude equal to the width of the square, and repeat the above calculation.
Do this until you’ve calculated everything, and output it all to a CSV file.

Now, when you’re dealing with 500,000+ pieces of data, this can take a while if you’re not taking advantage of your hardware’s capabilities. In the case of the NB data, it originally took 2 hours to crunch on my 2.3 GHz i7, which has 4 cores (8 virtual), but only one was being used.

Luckily, R now has the ability to do parallel processing to take full advantage of multi-core CPUs. Marcus over at R-bloggers has a great tutorial on how to do this in R if you want to have a look. Furthermore, using data tables instead of data frames can also improve performance by, in my experience, about a factor of 3 for this computation. Using both of these together got the time to process from 2 hours to about 2 minutes for both the NB and world population data set.

Plotting the Data

To plot the data, I wrote this script which does the following:

Load the data you want to plot, and pad the top and bottom to give it a border (padding the left and right is most easily done in a spreadsheet by adding rows of 0’s to the top and bottom).
Find the maximum value in all of the data, and select a factor that will be used to scale the line heights to fit the plot (this is done through trial and error).
Starting at the top, grab a row’s data, smooth it with a spline, and plot a white polygon at the appropriate height.
Outline the top of the polygon with a gray line.
Plot black line segments over the gray line if the values are over a certain threshold.
Repeat for the rest of the rows, moving down the plot area as you go so that the lower polygons overlap the upper ones.

That’s It!

There you have it - that’s all there is to these beautiful line plots. If you love the world map, go buy it from James here to support the work he does: Population Lines. And as always, if you have any questions, ask away below or send me an e-mail.

Harvesting Our Cities' Land for Dollars

2014-07-24T00:00:00+00:00

A few weeks ago my partner Gracen (of Another Place for Me fame) finished editing a video for this cool guy named Joe Minicozzi on “How We Measure the City”. Knowing my weird obsession with data, we sat down to watch it together, and one simple idea of his blew my mind: why don’t we treat our cities like farms? That is, when we consider how to use the scarce land we have in city limits for different things, we often measure how much a certain building will pay us back in tax revenue. But, we almost never consider how much we are earning per acre. And that’s a big mistake.

Naturally, I wanted to see what Fredericton’s profile would look like. And in particular, I wanted to see if I could create of the maps that Joe’s team created using only open source and freely available tools, and do it in such a way that it could be embedded online and easily access.

For land size data, I used GeoNB’s Digital Property Maps of the whole province. And for the property tax value, Shawn Peterson (@SaintJohnShawn), creator of Propertize.ca was nice enough to provide me with 4-years worth of data for every property in Fredericton.

The main tools I used were:

R for getting the data into something I could work with and doing all the math (code here),
QGIS for taking that data and mapping it to the land data tutorial
TileMill for styling the map, and
MabBox for hosting it.

At first, I tried using CartoDB as I did in my last post, but found the hosted version was quickly overwhelmed by the amount of data I was working with, and installing it locally turned out to be a nightmare. Even in the small tests I did it didn’t seem like the ideal tool anyhow. TileMill turned out to be a nice - if buggy - tool, as you’ll see below.

So let’s start with the first map. I wanted this one to accomplish a couple things:

It should clearly show the relative tax revenue per hectare of each property in the city (sorry Americans, I didn’t use the acre, but you can just divide by 2.5), and
It should allow somebody to zoom into any part to look at sections in more detail.

So what I did was take each property, calculate its tax levy per hectare, and then compare each of those to the median tax levy per hectare. What this gives me is the ability to take any piece of land, and see how much above/below average it is in terms of how much money it is generating for the city. I then grouped them into 10 buckets based on that difference where each bucket has the same number of properties in it, and designed a separate pop-up that will display the specific property statistics when you hover over it.

That was a lot to take in, so let me show you an example:

Let's say somebody has 0.2 ha of land, and pays $5,000/year in tax. Per hectare, that land is generating $25,000 in tax revenue for the city. The median tax per ha, however, is $34,000. This person would then be earning ($25,000 - $34,000) / $34,000 = -26.5%. That is, 26.5% below the median.

Now, I mentioned that TileMill/MapBox were a bit buggy, and you’ll notice that the below map is missing a legend. Well, your guess is as good as mine why this shows up ok on my computer, but breaks entirely when I put it into MapBox. I’ve spent hours trying to sort that one out. Anywho, here’s what it should be:

Percent More / Less Tax Revenue Generated per Hectare Compared to Median

Median Tax/ha: $34,018

Joe talks a lot about comparing malls to other styles of higher-density development, but what really stands out to me in this map on a macro-level is the clear difference between the old-style suburbs downtown between Brunswick and Beaverbrook (regularly generating well over $100,000 / ha):

The newer suburbs on the hill (most hovering sub-$50,000):

And worst-still, the newst-style suburbs near Canterbury Drive (where they rarely break $35,000). That’s a massive differnece in how we’re ‘harvesting’ our land.

This was interesting (and there’s a ton more I could have gone into here), but I also wanted to see if I could take it a step further. Since I had data from 2011-2014, I wanted to see if I could spot any clusters of neighbourhoods that were increasing or decreasing in value. This processes often take well over 4 years, however, but it was worth a try.

Taking the same approach as above, I calculated the change in tax revenue per hectare between 2011 and 2014, took the median amount, and compared all the values to that amount. Once again, I also group them into 10 buckets. Note that the number below isn’t the % change in tax for a property, but how much a given property changed compared to on average how all the other properties changed. So, if you pay $2000/ha more this year than you did 4 years ago, but everyone else does as well, your increase would be 0% more than the average.

Percent Difference Tax Change per Hectare Compared to Median

Median Change in Tax/ha: $1837

And here you really see the contrast between the decreasing property values in the student neighbourhood next to campus and everywhere else:

There’s almost too much data in this for a single person to analyze in one shot, so I may be chipping away at this over a few blog posts as I get better at working with TileMill. (Gracen has also offered to provide her highly-trained-urban-designer-brain for a look at it as well, so stay tuned to her blog for that.) Would also love it if people look a look at their own neighbourhoods and gave me some insight into why certain parts are the way they are!

Some next steps I’m considering for presenting this better would be:

Look for a way to do 3D plots in TileMill, as size is much easier to compare than colour.
Do plots of only the outliers to see who is really knocking it out of the park in harvesting their land / where are the biggest drains coming from.
Really dig into the details of the data instead of keeping the macro perspective.

Spot the Suburbs

2014-06-12T00:00:00+00:00

Edit: Big thanks to Chad Skelton (Vancouver Sun) and William Wolfe-Wylie (Canada.com) for picking this up and running with it! See their articles here:

Vancouver Sun
Canada.com

I just started playing with CartoDB today, and I can already tell how much time it is going to save me when doing any kind of map plotting - this thing is incredibly powerful.

Just to test things out, I uploaded all of the building locations in Fredericton (which unfortunately was too much for my tiny trial account, and I had to split it) and colour coded them by their street designation.

Nothing spectacular here, but it’s pretty interesting how quickly you can spot the suburbs with their many “Crescents” and “Courts”:

UNB Strike - It's a Revenue Problem, Not a Fairness Problem

2014-01-16T00:00:00+00:00

For parts 1 and 2, see here:

By the Numbers: UNB Operating Grant

Professor’s salaries, and how they compare to their president’s salaries

I was stunned by the reception of my last post. I honestly thought that nobody except a few data-minded people like myself would care to read it, but the response and feedback was incredible. I had planned to do my third instalment on some of the non-monetary issues being discussed, such as the decline in the number of professors on track to getting tenure, but I’ll push that ahead to #4 and take time address a few questions and go a little deeper into the last two topics. If you haven’t already, I recommend reading that last two posts or else this one won’t make much sense.

The big question people had after my last post was, “well, it’s interesting that UNB comes out in the middle of the pack of the president’s salary index, but what exactly does that mean? Is this a good thing for the administration, or AUNBT?” I’m cautious about drawing a strong conclusion, but I’ll tell you how I see it, and you can tell me if you agree. But first, to jog your memory, this is the graph I’m talking about:

(Source: University and College Academic Staff System data from 2000-2010 & CAUBO: Financial Information of Universities and Colleges)

So what’s going on here? Well, when you look at the median salary for a group of employees and how it compares to the top salary of that company, it gives a very good indication of whether the top-management has ‘lost touch’ so-to-speak. If the percentage is incredibly low, either the president is getting paid way too much, or the employees are getting paid way too little. This is a common metric used by all sides of the political spectrum to measure fairness of pay.

What UNB being in the middle of the graph above implies is that, as far as similar Canadian universities go, UNB’s pay as a whole _is fair. That is, given the money allocated for salaries for both professors _and administration, it’s evenly handed out between the two compared to all similar universities. So, if you accept the hard fact that UNB’s professors are among the lowest paid among their peers (they really are), and can stomach the fact that UNB’s administration is paid an amount that is fair in comparison (even if it seems high compared to what you or I earn), then the next part is easy: they are all paid too little.

To put it differently, if UNB wants to build a true, national university, and as part of attracting talent they need to offer competitive salaries, then they need to offer competitive salaries for all positions and ranks, and increase everyone’s salary. It’s already distributed fairly - there’s just not enough to distribute. In other words, we need more money. Period.

If you recall from my first post, I had a graph that showed the portion of the operating budget that was covered by tuition and by provincial funding:

(Source: CAUBO: Financial Information of Universities and Colleges)

Looking at this, it’s obvious that the biggest lever that can be pulled to generate revenue is the one for government funding. The common rebuttal to this is that we are going through ‘tough economic times.’ Fair enough. But it also doesn’t help that we’re a province whose population is the size of downtown Ottawa - nearly half of which is functionally illiterate (seriously - check StatsCan. It’s embarrassing.) - and we’re trying to support 4 universities and a number of colleges. Something has to give, and we’re witnessing that now. The university funding problem is the canary in the coal mine for much deeper, systemic issues that the province is facing. But that’s a much bigger topic than I can address here.

The next largest lever is student revenue. I don’t mean tuition, I mean tuition multiplied by the number of students. Tuition is already high here, and a ton of other people have run the numbers to demonstrate that. Instead, we have a recruitment problem: we need more students at UNB.

The third lever is related to something I glossed over in the first post, but have since circled back to. If you add the two numbers for any year in the chart above, there’s a small gap covered by ‘other’ that covers the remaining 7% or so. It turns out there is more to that 7% than I initially thought. What this part is mostly covered by is revenue from a university’s endowment fund, other investments, and donations that are allowed to be spent on operating expenses. (For those not familiar, the endowment fund is a pile of cash that universities were given over time and have invested, where they can only spend the interest.) As it turns out, it isn’t always this small for universities across Canada.

If you go to page 21 of this report by UNB, you’ll see that they are quite open about how much of a shortfall they have in this fund compared to other universities. But as always, these numbers don’t tell the whole story, because not all endowment funds are created equal. Some, for example, have a large percentage that can only be spent on things that the person who donated the money wanted it spent on. The true measure, then, is how many dollars are being generated by donated funds that can actually be used for things like salaries. Luckily, CAUBO has those numbers.

Instead of comparing totals of endowments or investments, for this analysis, I’ve looked at the portions of those that are allowed to be used in the operating budget. So, if someone donates $5M, but wants an auditorium named after them instead of letting it pay for professors or operating costs, it doesn’t count. Specifically, I looked at three data points in the CAUBO data, and combined them into one dollar figure: individual donations, endowment fund revenue and other investments. I then divided each by the number of full-time equivalent students at each university for the given year (with the exception of 2009, where CAUBO dropped the ball on collecting student numbers):

(Source: CAUBO: Financial Information of Universities and Colleges)

Note the massive hit that all universities took in 2008, but especially Queen’s and Simon Fraser where they actually lost money. This graph is quite sporadic because of the recession, but I felt it was important to show this to provide context. Below, I’ve taken a look at the newest data, 2011/2012, and added colour to show the proportion of the total operating budget each makes up.

(Source: CAUBO: Financial Information of Universities and Colleges)

There are a couple big take-aways I got from this. First, UNB has some serious catching up to do. Even universities like Dalhousie, which suffer from the same demographic and population issues that we do, are knocking it out of the park. For every dollar we generate from donations and our endowment to pay for salaries and operating expenses, they have $8.5. The second is that, even at the high-end for schools like Queen’s, this source of revenue only makes up 6% of the operating budget. It’s a lever, but at the end of the day, it’s a small one.

Putting this all together, it goes something like this: both the AUNBT and UNB are correct. Professors are paid too little, but so is the rest of the staff at UNB for an institution that has the ambition it does. We should pay everyone more, but we don’t have the revenue. We don’t have the revenue largely because our province can’t afford to pay what it needs to pay to support a national university, leaving us with only two levers under our control: student numbers and donations. And the donations we are getting, we aren’t allowed to be spent on operating expenses.

In short, we’re in quite the bind.

–

Ryan Brideau

Disclaimer: I know members of both the UNB administration and faculty, and consider many to be friends. I am not being compensated monetarily or in any other way to produce this analysis by either side. All of the scripts I’ve used to create these are available for scrutiny on my GitHub page (for those that understand R, that is - sorry, the data was too complicated for Excel): https://github.com/Brideau/UNBStrikeWatch

UNB Strike - Professor's Salaries, and How They Compare to Their President's Salaries.

2014-01-15T00:00:00+00:00

As I mentioned yesterday, I think there is a lot of value in having a 3rd party do some serious fact-checking on the numbers being released by both sides of the strike debate. Since salaries are the elephant in the room, that’s where I’ll be putting my focus for this article.

To provide some background, the AUNBT believes that professor salaries should be equal to the average of similar schools, and those in especially bad shape are the low-wage entry-level professors. The UNB administration believes that the better metric to compare salaries to is what percentage of their operating budget salaries takes up. With that in mind, I hope to accomplish three things with this article:

To explain why comparing salaries to the operating budget doesn’t really make sense,
To fact-check AUNBT’s numbers, and
Provide an alternative ratio that can be used across universities

Why comparing salaries to the operating budget doesn’t really make sense

First, this has no effect on whether a professor decides to come to UNB. I’ve never heard of a job offer going along the lines of “Your compensation will be $80,000, which represents 0.026% of our operating budget. We think you’ll be quite pleased with that, as other universities only offer 0.023%.” That just doesn’t happen, but using this kind of statistics implies that it matters. It simply doesn’t.

Second, and more fundamentally, what schools spend their operating budget money on is very different between schools. UNB, for instance, has an enormous amount of infrastructure given its student population, and some of it is very old. The maintenance and heating costs for these buildings takes a lot of money, and that will distort the operating budget in a way that would make it hard to compare to a school with a similar number of students but far fewer or newer buildings.

Now, on to the numbers…

The AUNBT has published this bulletin arguing for why they want a significant increase in their salary. They state that the average of 14 comparable schools is significantly higher than what they are currently earning, so it is only fair to increase it to that number to attract top talent. There are issues with this, however, like (as a friend of mine put it) “Queen’s profs get paid more because they are Queen’s profs” and the simple fact that it costs more to live in many other places with universities than it does to live in Fredericton.

Even with all this in mind, I decided to check the numbers produced by AUNBT so that others could have the data they need to verify what they’ve published. Using the University and College Academic Staff System data from 2000-2010 (sorry there’s nothing newer, but they killed the survey with the Stats Canada cuts), I decided to look at the same comparison schools and include the trend over years as well. I also converted everything to 2002-equivalent dollars so we can see how their salaries change, regardless of inflation. I looked at 3 different levels of professors, in order of rank: full professor, associate professor, and assistant professor. Each of these includes professors with administrative roles, and excludes salaries for people in medical/dental fields since not all schools have those, and they would distort the numbers.

Starting at the high end, I took at look at what some of the highest earners at each school made. For these, I looked at full professors in the 90th percentile income bracket (that is, if there were 100 profs, 90 would earn this amount or less) with UNB as the dotted line for emphasis:

(Source: University and College Academic Staff System data from 2000-2010.)

Remember that these are real dollars, so when the line goes up and to the right, it means that salaries are increasing above the rate of inflation. It’s worth mentioning that it’s clear from this that UNB has, for the last decade, paid lower wages than most other similar schools at the top-end of the professor market. (That’s not a value judgement, it’s simply a fact, and doesn’t take into account the things I mentioned earlier.)

Next, I did full professors median salary, keeping the graph dimensions the same to make it easier to compare, and received similar results:

(Source: University and College Academic Staff System data from 2000-2010)

The numbers that I have are 1-year behind what is used by AUNBT, but it should be pointed out that UNB has only recently fallen to last - a fact hidden by only showing 2010-2011 figures as they did in their report.

And then associate professors, 90th percentile, again similar:

(Source: University and College Academic Staff System data from 2000-2010)

And associate, median (similar):

(Source: University and College Academic Staff System data from 2000-2010)

Then assistant, median (once again…similar):

(Source: University and College Academic Staff System data from 2000-2010)

And finally, to show how much some of our worst-paid professors get, I did the 10th percentile for assistant professors:

(Source: University and College Academic Staff System data from 2000-2010)

This one is actually different from the others! We can clearly see that in this case, we used to pay our worst-paid professors around the average of the worst-paid bracket, but have since fallen behind.

To give you a feel for how these different salaries compare, I’ve taken the liberty to animate them as well:

But does this even matter?

And is there a better way to compare salaries that captures the essence of ‘fairness’, gives a way to compare schools in completely different geographies that have different costs of living, and also takes into account the simple fact that some schools are more prestigious that others? I think so, and it involves the also-controversial figure of university presidents’ salaries.

Since president salaries are also pegged to school size, prestige, and location - the same factors that affect professor salaries - a ratio of each university’s president’s salary to their professors’ salaries would cancel these factors out. And if you argue that presidents are paid too much, you can at least accept that they are all similarly paid too much. This approach is widely used as a measure of how ‘out of touch’ the salaries of top earners at a business or institution are compared to their employees, and so I’m comfortable using it here.

Using the president salary and bonus statistics provided by MacLean’s (who got them from this report), adjusting Eddy Campbell’s salary upwards by $30,000 as an estimate for what his bonus may be since that is missing from his datum, then adjusting each salary to be in 2011 dollars and including the number of students as provided by CAUBO, we can see the distribution of university president’s salaries (click for a larger version):

So what happens when you take the amount of each of the median salaries from before, and compare them to their respective president’s salary, all in 2002 dollars? This:

Full professors:

Associate professors:

Assistant professors:

UNB comes out directly in the middle of the pack. A completely unexpected, but interesting, result. And even more surprising is that, almost all professor’s salaries have been increasing _relative to their president’s for a decade. _(*EDIT: This last sentence is not necessarily true - see note above. With existing data, I can only compare recent years 2006-2010.*)

If you enjoyed this, also see my analysis of UNB’s operating budget.

The UNB Strike, by the Numbers - UNB Operating Grant

2014-01-14T00:00:00+00:00

I know people on all sides of the UNB strike debate personally, and I respect them all a great deal. But until things got to a point where the professors at UNB were going on strike, I wasn’t really paying attention to the debate all that much. When I started to dig into the numbers that were publicly available, I found them lacking context and clarity. So, I’ve decided to do them myself.

This first post is about the UNB Operating grant, or the ‘chequing account’ as I’ve heard it referred to. Basically, the fund of money that UNB can use at its disposal to pay for things like salaries, benefits, library costs, maintenance and other such things. It’s largely made up of two large components - provincial government funding and tuition/fees - and a small bit of other revenue.

I’ve heard it said that the government funding portion of this account has gone up and down for the past few years, and this is true in a sense, but not unless you’re familiar with the difference between ‘real’ and ‘nominal’ dollars. In nominal dollars - that is, the actual number you’d see in a financial report each year - it has gone up consistently, after being stuck on a gentle slope for a while:

(Source: CAUBO: Financial Information of Universities and Colleges. Note: the 2012 figure is an estimate from a graph provided by the UNB Comptroller in their 2012-2013 report, as these numbers are not available publicly through CAUBO yet. If anybody has access to better information for the past few years, let me know. )

In ‘real’ terms, however, things look a bit different. Real refers to the funding amount with inflation accounted for. Things generally cost more each year than the year before, and so putting them all in terms of a single reference year helps to compare different years. In 2002 dollars, the graph looks like this:

(Source: CAUBO: Financial Information of Universities and Colleges)

Here we see that from 1999-2004, the government actually didn’t increase the funding to UNB at all. Since then, it has gone up rapidly, with some years it going in reverse in real dollars. So, even though the government gave more ‘nominal’ dollars to UNB in 2009 than it did in 2008, what the university could purchase with those dollars decreased. That being said, UNB has much more funding today than it did in 2004 to the tune of 28.8%.

Looking at the growth rates of this fund, in nominal terms, it looks like it has grown steadily:

(Source: CAUBO: Financial Information of Universities and Colleges)

While in real terms, though the growth has been good, there has been some back-sliding:

(Source: CAUBO: Financial Information of Universities and Colleges)

How do these amounts compare to the amount funded by UNB tuition and ancillary fees? The following, also from CAUBO’s figures, shows the trends for what portion of the total operating budget are covered by tuition and what portion are covered by government funding (if the numbers don’t add up to 100%, it’s because there is another small bit lumped under ‘other.’):

(Source: CAUBO: Financial Information of Universities and Colleges)

In the next post I’ll be taking a closer look at the salary figures that have been released and place them in context of other similar institutions.

Disclaimer: As mentioned above, I know members of both the UNB administration and faculty, and consider many to be friends. I am not being compensated monetarily or in any other way to produce this analysis by either side. All of the scripts I’ve used to create these are available for scrutiny on my GitHub page (for those that understand R, that is - sorry, the data was too complicated for Excel): https://github.com/Brideau/UNBStrikeWatch

The Small Towns and Deserts of Canada

2013-12-22T00:00:00+00:00

I grew up in a small town in northern New Brunswick, have always had an affinity for ‘medium sized’ cities over big cities, and spend a lot of time thinking about what makes areas with high population density different from places like the Maritimes with so few people. When I came across Nathan Yau’s line maps about food deserts, I wondered if the same approach could be generalized for any country, and at any scale.

I managed to create this (click any image to see a huge PDF) by pulling tens-of-thousands of data points from Google Places:

Well…that’s interesting. But what is it? Well that this map shows you is how far you’d have to travel to a hospital if you found yourself at the end of any of those lines and in trouble - the black dots being a hospital or medical clinic. Unfortunately, this doesn’t really make any sense, because how often would you find yourself in the remote parts of NWT?

Given that fact, I set about modifying the script so that instead of doing a scan every 30km across the country in a grid, it would load a list of every small town in Canada (thank you Geocoder.ca). Essentially, this says “if you grew up in town x, how close were you to medical care if you needed it?”:

I have to admit I was pretty shocked when I saw this. Though there’s lots of talk about doctor shortages - and for good reason - with a few exceptions in the north, there’s pretty much a medical clinic close by if you ever need it. I went a step further and did a town-by-town analysis for New Brunswick as well:

This of course doesn’t tell the whole picture, like if the right kind of doctor is at the right place when you need them, but this general approach could make for an interesting way to present data if more time/care was taken to look at a very specific case.

Now what about other things, like cultural attractions? For example, I had never been exposed to art growing up in my home town. As it turns out, there was good reason for it:

All those red lines pointing toward Miramichi mean that I was essentially in a complete art desert for most of my life. Contrast this to someone from southern Ontario where they practically have art farms:

Small-town Saskatchewan, Alberta and Manitoba, as it turns out, aren’t in a much better situation than NB:

My next approach was to see if I could get this to work on a very small scale - the size of a city like Fredericton. Initial attempts to use a uniform grid to scan the city and draw lines to the nearest locations didn’t work out because they simply didn’t make much sense at that scale (nobody lives in O’Dell park, so what does it matter how far it is from a grocery store?). So this time, I modified the script to read in the coordinate for every building in NB, filtering it for buildings in the city. My first graph shows the ‘as the crow flies’ distance from places to the nearest grocery store, giving a feel for how much car travel is required just for daily errands:

I wasn’t very happy with how this one turned out however, as a quick scan found a couple of suspect points from my Google Places API grab. Some of those black dots don’t correspond to places where a person could actually get groceries. At the small scale, you really are at the whims of the quality of data Google has. It would also be better to do this by driving distance vs. a straight-line, and maybe I’ll try that in the future.

To try something a bit different, I grabbed the recycling bin locations from Fredericton’s open data catalogue (this one had a whopping 4 data points) and did a similar analysis showing how far most homes and buildings are from the bin locations:

Now before you jump on me, I know that many places have pickup. But if you happen to be living in an apartment or multi-family dwelling (two places inhabited more often than not by people with lesser means) and have ever tried to go to one of these things, maybe this can shed a light on why they are always such a mess:

There’s lots more than can be done using this approach, and I’d love to hear people’s ideas. I’ve put all of the code up on GitHub, though it might be a bit hairy to go through since it’s in a bunch of different files: https://github.com/Brideau/LonelyPlaces/

Mapping New Brunswick With Only Properties

2013-12-03T00:00:00+00:00

One of my first introductions to data analysis and visualization was a map created by MIT student that plotted every citizen in the United States. No other feature was present: no roads, no rivers - nothing but people - yet the contours of the country are immediately apparent. It’s gorgeous, and is an impressive feat.

If you look up close, you see that the points are randomly scattered, but at a distance, you can’t notice that fine of detail. It occurred to me that the same kind of macro-look could be accomplished with a simple plot of property geolocations. Luckily, GeoNB has exactly that dataset available: http://www.snb.ca/geonb1/e/DC/catalogue-E.asp

After that, it’s as simple as using open-source GIS software QGIS, tweaking a few visual parameters, and you have an amazing black and white version of your own.

EDIT: Fun fact: creating this same plot for Nova Scotia (using about 560,000 data points) would have cost me $8960.

Whack Data

Why Does Pi Show up in the Normal Distribution?

First, what exactly is a bell curve?

How Pi is related to the bell curve

Converting Geospatial Files Using ogr2ogr in Scala

GDAL Setup

Project Setup

Merging the ogr2ogr Java Port with Scala

Calling ogr2ogr

Passing Additional Parameters

Find a Geospatial File's SRID Using Scala and GDAL

GDAL Setup

Project Setup

Blocking Approach

Non-Blocking Approach

How Much Income Tax Canadians Pay: A Provincial Comparison

Canadian Provincial + Federal Effective Tax Ranking, 2016

Comparing the Effective Tax Rates Over Time

Hidden Income Taxes

How to Increase Taxes Without Increasing Them: Bracket Creep

Why A Flat Tax Isn't Flat

Line Graphs and Parallel Processing in R

So, how was it done?

Extracting the Data

Processing the Data to be Plotted

Plotting the Data

That’s It!

Harvesting Our Cities' Land for Dollars

Spot the Suburbs

UNB Strike - It's a Revenue Problem, Not a Fairness Problem

UNB Strike - Professor's Salaries, and How They Compare to Their President's Salaries.

The UNB Strike, by the Numbers - UNB Operating Grant

The Small Towns and Deserts of Canada

Mapping New Brunswick With Only Properties