2015 UK General Election Normalising Data

From SQLZoo
Jump to navigation Jump to search

The unnormalised data from the CSV file looks like this:

Notice that it includes redundancy - constituency names are repeated for example. We can see that link between the ons_id and the constituency_name is repeated over and over.

Pick a primary key

Our first job is to identify a primary key. There are several options to consider:

  • First we notice that there is no single column that is unique and so we will need to use at least a pair of columns.
  • (firstname,surname) would be good
    • The candidates are unique. By law, no one is allowed to stand as a candidate in two constituencies
    • Unfortunately candidate names are not unique - there are two candidates called "Alan Johnson" for example. You can confirm this with a query such as
      SELECT firstname,surname COUNT(1) FROM ge HAVING COUNT(1)>1
  • The combination (ons_id,party) is also tempting
    • No party will put up two candidates in the same constituency, that would be self defeating and against the rules.
    • Unfortunately there are independent candidates with a NULL party and we cannot have NULL values in the primary key
  • It turns out that the triple (ons_id,firstname,surname) is unique. You may not have more than one candidate in a constituency with the same first name and surname. This would be confusing for voters. We can verify that this is a safe choice with a query such as
    SELECT firstname,surname,ons_id FROM ge GROUP BY firstname,surname,ons_id HAVING COUNT(1)>1;

Identifying dependencies

The columns headins are:

ons_id	ons_region_id	constituency_name	county_name	region_name	country_name	constituency_type	party_name	party_abbreviation	firstname	surname	gender	sitting_mp	former_mp	votes	share	change

Having decided on our primary key as (ons_id, firstname, surname) we notice the following dependencies:

ons_id                   -> ons_region_id
ons_id                   -> consitituency_name
ons_id                   -> county_name
ons_id                   -> constituency_type
county_name              -> ons_region_id
ons_region_id            -> region_name
ons_region_id            -> country_name
party_abbreviation       -> party_name
ons_id,firstname,surname -> gender
ons_id,firstname,surname -> party_abbreviation
ons_id,firstname,surname -> sitting_mp
ons_id,firstname,surname -> former_mp
ons_id,firstname,surname -> votes
ons_id,firstname,surname -> share
ons_id,firstname,surname -> change

Decide on tables

Each distinct determiner (the left hand side of the -> above) will be a table. The determiner will be the primary key in each case.

Bold indicates a primary key, italics indicates a foreign key.

  • constituency(ons_id, constituency, county_name, constituency_type)
  • county(county_name, ons_region_id)
  • region(ons_region_id, region_name, country_name)
  • party(party_id, party_name)
  • candidate(ons_id, firstname, surname, gender, party_id, sitting_mp, former_mp, votes, share, change)

The Entity Relationship Diagram for this database is:

Implement the tables

We need to start with the tables that do not have out-going foreign keys.

party and region are simple tables

Party:

CREATE TABLE party(
  party_id VARCHAR(50) PRIMARY KEY,
  party_name VARCHAR(50)
)

Region:

CREATE TABLE region(
  ons_region_id VARCHAR(10) PRIMARY KEY,
  region_name VARCHAR(50),
  country_name VARCHAR(50)
)

county has a reference to region

Now that we have the foreign key target in place we can introduce county which refers to ons_region_id

CREATE TABLE county(
  county_name VARCHAR(50) PRIMARY KEY,
  ons_region_id VARCHAR(10) NOT NULL,
  FOREIGN KEY (ons_region_id) REFERENCES region(ons_region_id)
)

Notice

  • we take care to indicate NOT NULL on the foreign key - this is because the relationship is not optional. Every county MUST HAVE a region.

constituency references county

And now constituency which references county_name:

CREATE TABLE constituency(
  ons_id VARCHAR(10) PRIMARY KEY,
  constituency VARCHAR(50) NOT NULL UNIQUE,
  county_name VARCHAR(50) NOT NULL,
  constituency_type VARCHAR(10)
                   NOT NULL
                   CHECK (constituency_type IN ('county','borough')),
  FOREIGN KEY (county_name) REFERENCES county(county_name)
)

Notice:

  • There are exactly two constituency types, roughly county is in the countryside; borough is in the city.
  • constituency names must be unique

candidate is the complicated table

And finally candidate:

CREATE TABLE candidate(
  ons_id VARCHAR(10),
  firstname VARCHAR(50),
  surname VARCHAR(50),
  gender VARCHAR(10) NOT NULL,
  party_id VARCHAR(50) NULL,
  sitting_mp VARCHAR(3) NOT NULL CHECK (sitting_mp IN ('Yes','No')),
  former_mp VARCHAR(3) NOT NULL CHECK (former_mp IN ('Yes','No')),
  votes INT  NOT NULL,
  share FLOAT NOT NULL,
  `change` FLOAT NULL,
  PRIMARY KEY (ons_id,firstname,surname),
  FOREIGN KEY (ons_id) REFERENCES constituency(ons_id),
  FOREIGN KEY (party_id) REFERENCES party(party_id)
)

Note

  • change is a reserved word and must be enclosed in back-ticks
  • party may be NULL. The relationship between candidate and party is optional
  • change may be NULL - it gives the change in the vote for a party and only applies if that party had a candidate in the most recent general election before 2015.

DROP TABLES

If you need to start again you must drop the tables in the reverse order. Last created is the first dropped.

DON'T DROP THE TABLES!

DROP TABLE candidate;
DROP TABLE constituency;
DROP TABLE county;
DROP TABLE region;
DROP TABLE party;

Add the region data

INSERT INTO region
  SELECT DISTINCT ons_region_id, region_name, country_name
    FROM ge

Add the party data

Annoyingly the abbreviation Lab has been used for two different parties "Labour" and "Labour and Co-operative". We are going to ignore the distinction between these parties and treat them as one.

In doing this we are losing data and in general that should not be done. There is some justification in this case. When is comes to calculating who has a majority in parliament the "Labour and Co-operative" party seats count towards "Labour". There are other alliances (mostly in Northern Ireland) where parties are automatically aligned to mainland parties but these are a little more complicated.

That means we cannot use the DISTINCT method. Using MIN will arbitrarily assign the first value (alphabetically) to the pesky "Labour and Co-operative" party members.

INSERT INTO party
  SELECT party_abbreviation, MIN(party_name)
    FROM ge
   GROUP BY party_abbreviation;

INSERT county data

INSERT INTO county
  SELECT DISTINCT county_name,ons_region_id
    FROM ge;

INSERT constituencies

INSERT INTO constituency
  SELECT DISTINCT ons_id, constituency_name, county_name, constituency_type
    FROM ge;

INSERT candidates

INSERT INTO candidate
  SELECT ons_id, firstname, surname, gender, NULLIF(party_abbreviation,'Ind'),
         sitting_mp,former_mp,votes,share,NULLIF(`change`,'')
    FROM ge;

The function NULLIF gives NULL if the arguments match.

That means that the party abbreviation will be put in place unless the party abbreviation is 'Ind'. Note that this is an important distinction; there may be many Ind candidates in one constituency - they are not in the same party.

DROP the ge table

With reckless abandon we can now drop the flat, unnormalised table ge. It has served its purpose.

DROP TABLE ge;

What could possibly go wrong?

DataWars, Data Science Practice Projects - LogoDataWars: Practice Data Science/Analysis with +100 Real Life Projects