A code list is a type of controlled vocabulary containing a finite list of codes and meanings that represent the only allowed values for a particular data item. At the current 360Giving standard use a codelist only for Geographic Code Type (see the codelist of geographic code types here).
Before we dive into those suggestions, I believe we need to decide if we want to use codelists in general or not. How will we introduce them and how will it affect the data that we currently have?
What are the pros for having a codelist? What are the risks?
Iâm interested to learn what codelists IATI and Open Contracting use as we should take that into consideration and try to complement where we can. I think using a codelist for currencies seems obvious and would expect us to do what other standards do. We may work mainly with UK-registered grantmakers, but they fund in lots of different currencies. We shouldnât over-complicate this, but we should align ourselves with best practice and in an effort to make the different standards interoperable where we can.
The upside of controlled vocabularies is much more meaningful, analysable data. Instead of text in a field, you have a value with semantics. The downside is you force people to sometimes make hard choices about which value to pick.
There is a middle ground option which is sometimes valuable: free text with recommended values. You provide a list of recommended values, but allow new values to be used for unforeseen situations, kind of like a folksonymy. And hopefully you have a mechanism for collecting them and bringing them into the standard. This is basically how OpenStreetMap tagging works, and mostly it works ok.
Where possible, I think you use existing standards (like currency codes, country codes, etc).
(1) Using established code lists (e.g. currency) makes sense. The key issues that arise here are:
Synchronisation - making sure we have the latest version of the codelist
Deprecation - handling the fact that codes can go out of use, but that we donât want to suddenly have data that doesnât validate because a value has disappeared from a codelist
Ideally, there would be good external sources to refer to for most codelists (e.g. currency lists), but in practice groups like ISO donât maintain really nice free and open places to point.
The Frictionless Data project is maintaining some useful packages though as codelists, and we could consider co-maintaining codelists there for currencies, countries etc.
(2) Codelists we have to create
When these are âopenâ lists (i.e. adding a new value doesnât dilute or change the meaning of the other values) then the main issue is just about the amount of work required technically to maintain a list, and do the research to keep it updated and accurate.
When these are âclosedâ lists (i.e. trying to categorise the world into a limited set of buckets, such that a new category changes the others), then there is much more work around negotiating meanings to be done - and the challenge is political more than it is technical.
I understand OCDS and IATI have tried to keep the number of lists like this they maintain to a minimum, only creating them when there is a real user need.
A third issue is when codes remain the same, but the boundaries they represent change, and itâs ambiguous which version of a boundary is referred to in the data. This is pretty significant with electoral boundaries (at least in Australia).
The solution to both that and the deprecation issue is to include a version identifier. So itâs not just âLocal government areaâ, itâs âLocal government area 2015â or whatever.
I worked on a standard for Australian regional boundaries, csv-geo-au, that does that. You can distinguish between different versions of electoral boundaries with CED_2011, CED_2016 etc.
In the Australian context we are lucky that the Australian Bureau of Statistics has done all the hard work, with the Australian Geographic Statistical Standard, maintaining excellent sets of identifiers which are updated on a regular (although infrequent) basis. I donât know what the global equivalent would be exactly - probably not as authoritative.
You asked about risks in the original post. One that hasnât been mentioned so far is the potential barrier code lists could create for getting people publishing to 360Giving.
Code lists are essentially standards within a standard, so if an organisation has already opted for a codelist (or no codelist) and 360 specifies one, it could be harder/more time consuming to get the data in shape or to convince a new organisation to take part.
Itâs often not as simple as picking eg an ISO code for country - there are 2 and 3 letter codes. So whilst it may seem on the surface that there are obvious candidates, there are nuances. And as mentioned already, the debates arenât just technical. Organisations may have opted for a different code list for very valid reasons. Mapping might then be required from one code list to another.
Thatâs not to say there arenât benefits to making this data comparable by standardising, but in some cases it might be best to make a recommendation on a good list to use but not to insist on a particular one.
A lower barrier for publishers that could help data re-users to interpret the data, is to ask that metadata associated with the data set declares the code lists that are used for relevant fields.
Exactly yes â so thatâs the approach IATI goes for with a number of fields, by way of vocabularys.
For instance, thereâs a default codelist for sector, but publishers arenât obliged to use it. Instead, the publisher can choose a different vocabularyfrom this list, and use a sector code from the chosen vocabulary. Itâs also possible to indicate a custom vocabulary (99), specify a URI for it, and then use codes from that. So itâs super flexible.
One thing I would warn about on this (though itâs probably a warning that has been made elsewhere!) is that there are two ends to this see-saw! Whenever you make things a little easier or more flexible for the person publishing the data, you also make things a little more complicated for anyone directly consuming the data.