The Cricsheet Register
What is the register?
The Cricsheet Register consists of multiple CSV files containing a unique identifier for 13,817 people, and including 22,164 identifiers from 9 sources, all linked together so that an id from one site can be used to find the identifer for that same person at another site. The data also contains 8,378 name variations for the people covered.
The people currently covered by the data include:
- Every player and official who appears in a match for which Cricsheet provides data.
- Every player who appears in a match covered by the Cricket Scorecard Accuracy Project.
- A smattering of random players encountered while fixing some incorrect data on matches Cricsheet covers.
Bear in mind that not all ids are available for each person in the data. Only those required by either Cricsheet itself, or the Cricket Scorecard Accuracy Project will be provided.
What sites are covered?
The data includes identifiers from 9 different sites. They, along with the number of identifiers from each, are:
- BCCI (1,329)
- Big Bash (296)
- Cricbuzz (22)
- CricHQ (24)
- Cricingif (130)
- CricketArchive (4,417)
- ESPNcricinfo (13,855)
- Opta (68)
- Pulse (2,023)
Where is the Register currently used?
The data is currently used within 2 projects:
- Cricsheet itself uses the data in order to recognise and map players and officials when processing various data sources for matches, and is gradually adding the Cricsheet identifier from the Register to the match data. Most of the initial work of collecting ids was done to make Cricsheet work easier.
- The Cricket Scorecard Accuracy Project (CSAP) uses Register data to check whether the players mentioned in the different scorecards for a match are actually the same person. CSAP provided a valuable push in getting this project into a releasable state, as I suddenly had a second case where I needed the mapping data, and also had to collect, and check, many more mappings into order for the project to work.
What data is available?
The people in the Register can be found in the file
This file is a CSV consisting of a single row for each person, containing the id for the person, as assigned by Cricsheet and used in Cricsheet data, their name, a unique version of their name (as used in the match data files), and columns for the various identifiers used for the person. Not all ids are provided for all people, only those determined so far through use in Cricsheet or the Cricket Scorecard Accuracy Project.
The fields on each row (after the header) include, but are not limited to, the following:
- The Cricsheet identifier for the person, as used in
- The name used for the person in Cricsheet data. This can (and will) be used for multiple people of the same name
- The unique name used for the person in Cricsheet data. Guaranteed not to be used for another person
- The person's identifier on BCCI
- The person's identifier on Big Bash
- The person's identifier on Cricbuzz
- The person's identifier on CricHQ
- The person's identifier on Cricingif
- The person's identifier on CricketArchive
- The person's identifier on ESPNcricinfo
- The person's identifier on Opta
- The person's identifier on Pulse
names.csv provides a list of alternate or variant names for people, if they are different from that provided for the person in
people.csv. These names generally include the variations used on the different sources, although some will be due to name changes (for a variety of reasons). These names are not unique, so the same name may appear multiple times.
Each row of the file, other than the header, consists of the following fields:
- The Cricsheet identifier for the person
- The variant name for the person
How often are the files updated?
The data is updated whenever either a new person is added to the Cricsheet match data, or a new identifier is added for an existing person. The files are then uploaded to this site when it is rebuild.
What if I spot an issue/error?
If you spot an issue or error in the Register data you should get in touch by following the instructions on the Contact page.
This dataset is made available under the Open Data Commons Attribution License: http://opendatacommons.org/licenses/by/1.0/. If you're going to use the data you should have a read of the license, however here is a human-readable expression of some of its key terms, although this is just a summary.
- To share: To copy, distribute and use the dataset.
- To create: To produce works from the dataset.
- To adapt: To modify, transform and build upon the dataset.
As long as you:
- Attribute: You must attribute any public use of the dataset, or works produced from the dataset, in the manner specified in the license. For any use or redistribution of the dataset, or works produced from it, you must make clear to others the license of the dataset and keep intact any notices on the original dataset.