Resources for check-in #3 - Cleaning Data


#1

Previous week: Resources for check-in #2 - Checking Data
Next week: Resources for workshop 2

Homework

  • Read the handout - Cleaning Data: Part 1
  • Read part 2 of the handout if you’d like to try out cleaning data - Cleaning Data: Part 2
  • Read about combining data if you’d like to try it out - Combining Data
  • Reply to this thread with your reflection and:
    • Add your final refined question or part of your question that you have the data to answer below.
    • Add links to the data you’ve found - the form isn’t important, pdf, spreadsheets etc.

Resources for workshop #2
#2

I’ve updated the handout Cleaning Data: Part 1. It was linked to the slides, it’s now linked to the pdf.


#3

Hi Edafe,

So I have been looking at my data to think about cleaning it and just realised something! So the NHS Digital data I have been looking at (https://digital.nhs.uk/data-and-information/publications/statistical/adult-social-care-activity-and-finance-report/adult-social-care-activity-and-finance-report-england-2016-17) is the reference data tables (table 15 in particular).

This is obviously not the raw data, but is using some of the CSV data. However can I just copy some of the data from the table into a new spreadsheet because it is not easy to use the CSV data because it is in a code for different support types etc. and not easy to understand.

I hope this makes sense.

Thanks, Ethny


#4

Hi Edafe,

I have spent the last couple of days catching up with my data homework. My refined question is “Has the number of user led organisations decreased since 2010 as a result of reductions in funding from local or central government?”. I still think the last bit needs some thinking. I have found quite a lot of data sources, which I need to properly look at between now and Tuesday, but the one I would really like to look at is from the Charity Commission, http://data.charitycommission.gov.uk/, but it needs an SQL server, can you help me out with access to it? I would like the earliest and latest sets.

Also if anyone knows a decent source on local authority grants to groups let me know.

Thanks!


#5

Hi Edafe,

I’ve combined my grantnav data with the England, Scotland, Wales and Northern Ireland deprivation data. I’ve put each country in seperate columns as it’s from different data sources.

The date issue that I mentioned in the last check in looks like this: 2016-09-13T00:00:00+00:00.

Looking forward to Tuesday.
Thanks, Abi


#6

Hi Abi and Edafe,

I am also looking to combine my grantnav data with the England deprivation data, but haven’t got round to it yet.

Not sure what the data issue you are mentioning is Abi but I also have that format of date in one of my data files.

I can’t make it to London next week as I’m on leave for 2 weeks but will catch up when I’m back! Enjoy.

Bethan


#7

Hi John

Excellent stuff! The Charity Commission data is pretty large and contains quite a few different tables. Here’s a list of the data definitions (google drive) and the Charity Commission data definitions website.

I’ve added the two main data tables to the Charity Commission folder on Google drive for download. They are chunky files, so let me know if you have any trouble opening them or you would like them combined with other data.

Local authority grants to groups are part of the Local government transparency code 2015. As far as I know, they aren’t split off in any way. You can either find a separate details of grants paid to voluntary, community and social enterprise organisations or grant information can be included in the council’s spend data - not a great situation at the moment.

NCVO Almanac 2018 is a useful source for estimated income but it’s best used for a national picture as the sampling means it gets less useful for a regional or local picture.


#8

Hi Abi

That’s brilliant work, looking forward to seeing your combined data.

Here’s how to fix the date issue: Excel - How to convert UTC date time - alternatively, send me your data and I’ll fix the dates.

All the best

Edafe


#9

Hello! My (somewhat) refined question is: Is our London-wide hardship fund reaching the right parts of London, i.e. where people most need help? To answer this, I want to compare the indices of deprivation with the location data I have for our hardship fund. I was also hoping to compare it with other hardship funds by searching GrantNav, but I couldn’t find any on there. Might be searching it wrong… My own data doesn’t drill down any further than borough level, unfortunately. Edafe, I will email you my own data now in case you want to have a quick peek at it before 12 o’clock.
Alison


#10

Hi all,
A few notes about my search so far:

  1. Exported all grants that came up with the term ‘music’ in them through grant nav
  2. Removed all w/ award date pre-2015 (6,209 records)
  3. Removed all outside of England (on the region layer)
  4. Verified that they were in fact music - initially I searched for the term ‘music’ in organisation name, title and description - but manually checking these results it seemed that not all music related grants used the words music (e.g. words like opera, sing, choir, orchestra, tune, sinfonia, compose…etc. were not all used in conjunction with the term music and so weren’t being caught by top level search.
  5. With a bit of digging, I was able to come up with a clunky search system that looked a little like this:

=IF(SUMPRODUCT(–ISNUMBER(SEARCH(B2:B9,A1))),1,0)

(I made a seperate tab called ‘search terms’, fixed the paramters with $'s and then inserted rows in between to add new terms).

  1. As not all grants had regional data, and I had to manually remove some grants with no data that appear to be in Scotland/Wales/NI or abroad. I found doing this by grant body was easiest (e.g. Robertson Trust seem to only grant in Scotland) but had to inspect manually through the remaining records.

  2. This left me with around 1,320 grants of which 83% (1,099) had district data - which was not bad.

  3. I cross tabulated this with IMD district data - however I noticed that the dataset I had didn’t align with the Grant Nav terms (most likely because it was a set I had cleaned before to align with our database terms but can’t recall!) so these had to be aligned before using vlookup to idnetify the proportion in 20% most deprived local authority districts.

  4. It looks like the total investment related to music in England since 2015 was £45,208,001. I found that £8,939,617 (19.8%) of funds allocated since 2015 went into the 20% most deprived LA districts.

|1|£8,938,617||19.8%|
|2|14863038.09||32.9%|
|3|4583643.36||10.1%|
|4|3675877.09||8.1%|
|5|£8,169,423||18.1%|
|unaccounted (regional)|£4,978,001||11.0%|

I’m fairly sure this could be improved on and done in a more efficient way - so sharing my method for feedback!


#11

Another thing I forgot to mention - this data was based on the location of the recipient organisation (not where the beneficiaries were). Data of where the beneficiaries were located was much more patchy, so I worked with organisation location in the first instance. I did consider if it might be worth contrasting organisation/beneficiary location where there was data to see if I could get an indication of how this changes - but ran out of time!


#12

Thanks Alison. Lovely to catch up with you yesterday. Please do send the file over and I’ll take a look before our next check-in.


#13

Great work Nick!

I took a look over your steps, here’s how you could use GrantNav’s advanced search to do some of the heavy lifting:

  1. Get all grants with music - this is a good starting point, on the main page search for:

    music

When I ran this search on 9 July 2018, I got 8,165 GBP grants and 0 non-GBP grants

  1. Now we limit the grants to after 2015 using the advanced search for

    music AND awardDate:[2015-01-01 TO 2018-07-09]

Total GBP grants: 1,954
Total Non-GBP grants: 0

  1. Now we want to limit to England. As you found, not all grants have regions but we can exclude the regions we don’t want using the advanced search
    music AND awardDate:[2015-01-01 TO 2018-07-09] AND NOT recipientRegionName:Scotland AND NOT recipientRegionName:Wales AND NOT recipientRegionName:“Northern Ireland”

I can still see entries with Scotland, so let’s try excluding Scotland, Wales and Northern Ireland from any fields

music AND awardDate:[2015-01-01 TO 2018-07-09] AND NOT Scotland AND NOT Wales AND NOT “Northern Ireland”

Total GBP grants: 1,581
Total Non-GBP grants: 0

  1. Now to add the words associated with music - this is hard because you really need to know which words to use - I stuck with your list opera, sing, choir, orchestra, tune, sinfonia, compose and added them as an alternative to music

    (music OR opera OR sing OR choir OR orchestra OR tune OR sinfonia OR compose) AND awardDate:[2015-01-01 TO 2018-07-09] AND NOT Scotland AND NOT Wales AND NOT “Northern Ireland”

Total GBP grants: 1,889
Total Non-GBP grants: 0

  1. I stopped here and searched the csv for regions we didn’t want. Despite the exclusions below, I removed grants by scottish funders where Funding Org:Identifier starts with ‘GB-SC’ and removed grants with the word ‘Welsh’

Original: 1,889
Remove GB-SC: 1,857
Remove Welsh: 1,856

I used the beneficiary location as a proxy for Recipient District Geographic Code - first, replacing LSOA level codes E01 with district codes: E06, EO7, E08, E09 and E10. Then copying over the matched district levels.

We’re left with 226 of 1856 without any geographic codes at all. We’re now ready to merge with the local authority level indices (results are in grantnav-20180709165922-clean.csv). To merge with local authority level indices, we group by year, currency and local authority code (you could skip the year and currency, the only real requirement is the local authority code). The final file is grantnav-20180709165922-clean-grouped.csv - it includes the grants which have no geography, so can filter those out.

PS If you decide to group the grant years, remember you can’t add up the indices - try min or max instead of sum or avg.

You can find all the files here: Music Grants Data

I’ll take a look at your spreadsheet and get back to you on that!


#14

Hi Bethan

Hope you had a great break!

I’ve been looking at the English indices at LSOA level which you’re trying to match to grantnav data.

Unfortunately, the two can’t be combined at Ward / District level.

This is because the ranking is done at LSOA level for all of England. Ranking by ward isn’t currently supported as wards tend to vary in size, so you wouldn’t be comparing like with like.

There’s a couple of workarounds:

  1. Use the local authority level ranking
  2. See if you can download ward level data from http://www.localhealth.org.uk/#l=en;v=map4 - it’s not been quality checked

Let me know what you’d like to do next.


#15

Hi Alison

Thanks so much for the data, I’ve split the boroughs and matched them up with London’s indices of multiple deprivation at borough level.

In the Hardship fund folder, you’ll find 3 files:

In the Hardship Fund spreadsheet, there are a few things to note:

  • The ‘Main’ sheet includes all grants in your original file. The only column missing is the boroughs.

  • The ‘Indices’ sheet includes all indices of multiple deprivation for London boroughs.

  • The ‘Boroughs’ sheet includes all boroughs in the 'Main boroughs(s) informed choice ’ column of your original file. Here, the boroughs have been split up, so a single borough is shown on each row. The ‘RowNumber’ column helps you link them back to the ‘Main’ sheet.

  • The ‘Combined’ sheet combines all the information in the ‘Main’, ‘Indices’ and ‘Borough’ sheets in one place. Note that there are grants missing indices - these are either outside London boroughs according to the London datastore or are London-wide. I cleaned up a few names to make sure the boroughs matched and split each grant amount pro-rata over all boroughs listed.

  • The ‘Grouped’ sheet is the one to use - it correctly combines grants for each borough with the indices for that borough. Note that the London-wide and outside London counties are excluded. Also note that only the grant amounts are summed up per borough - the indices are not because the information is already at borough level.

Let me know if you find anything odd or missing.


#16

Hi Edafe

This is amazing, thank you. I’m looking at the ‘grouped’ sheet. Can you give me a quick idiot’s guide to what the pro-rata grant amount column is, i.e. what it contains? I don’t get it…

Thanks!

Alison


#17

Hi Alison

The prorata grant amount is the total grant amount divided amongst all the boroughs reported.

For example, a grant of £10,000 is given to organisation XYZ. They report 4 London boroughs: Hackney, Westminster, Fulham and Chelsea.

Each borough is pro-rated £2,500 for that grant.

When we come to work out how much money each borough got, we can add up the pro-rata amount and group by the borough.

Does that make sense?


#18

Yes, I think so. So for Barking and Dagenham, where it says 4928.678093, that means £4928. - followed by a lot of decimal points?


#19

Yes, that’s right. The decimal places are down to dividing the grant money up. You can’t round up if you prefer.