SOCcer (Standardized Occupation Coding for Computer-assisted Epidemiological Research) is a publicly available application that was developed to assist epidemiological researchers incorporate occupational risk into their studies. The application is not intended to replace expert coders, but rather prioritizes job descriptions that would most benefit from expert coders. Low scoring job descriptions are more likely to require expert review than high scoring job description. The coding is performed using an ensemble classifier, which combines the results of multiple classifiers to produce a single classifier that performs better than any single classifier in the ensemble.
If you publish results that use SOCcer, please reference: Russ DE, Ho K-Y, Colt JS, Armenti KR, Baris D, Wong-Ho C, Davis F, Johnson A, Purdue MP, Karagas MR, Schwarz K, Schwenn M, Silverman DT, Stewart PA, Johnson, CA, Friesen MC, Computer-based coding of free-text job descriptions to efficiently identify occupations in epidemiological studies. Occup Environ Med 2016;73(6):417-24.Run SOCcer
To begin, select a model version and upload a csv input file which conforms to the input data format.
SOCcer can use different models, depending on the type of data that you have. The table below describes the models available.
|model v2.0||SOCcer v2.0 was developed using an expanded training data set comprising job descriptions from epidemiologic studies. V2’s scoring algorithm was revised to account for deviations from linearity in the maximum entropy classifiers for job title and task and interactions between classifiers. SOCcer’s v1 vs. v2 agreement with the expert for 11,943 jobs from a population-based case-control study improved from 44.5% to 50.2% at the 6-digit level and from 51.6% to 56.3% at the 5-digit level. Ties increased from 1.1% in v1 to 7.1% in v2; however, 97% of v2 ties were at scores <0.1. V2 had a stronger, less attenuated linear relationship between expert agreement and SOCcer score than v1, with v2 scores of 0.50 and 0.75 predicting 51% and 70% agreement with experts, respectively. A presentation about SOCcer 2.0 and its performance in comparison to SOCcer 1.0 from the X2018 International Conference on the Science of Exposure Assessment held in Manchester, UK, on September 24-26, 2018 is available at the following link: SOCcer 2.0: An improved computer algorithm to code free-text job descriptions to standardized occupation classification codes.|
|model v1.0||This model codes job descriptions to the SOC 2010 classification system as described in Computer-Based Coding of Free-Text Job Descriptions to Efficiently and Reliably Incorporate Occupational Risk Factors into Large-Scale Epidemiological Studies. This model uses the variables JobTitle, SIC, and JobTask. This model can be used even if parts of the Job Description are not available (e.g. SIC or JobTask is missing). The classifiers will assign a '0' for the missing information in the calculation of the overall SOCcer score.|
SOCAssign is an application to assist expert review of the top 10 SOCcer assignments for each job description to provide an expert SOC-2010 assignment. SOCAssign will read SOCcer ouput. Before importing, the SOCcer results can be preprocessed to focus the expert review on a subset of job descriptions, such as job descriptions with SOC codes that are tied for the highest score or that had low SOCcer scores. For each job description, SOCAssign will allow the selection of up to 3 SOC-2010 codes. The code scan be selected from the SOCcer output list, from a list of all SOC-2010 codes, or manually entered. A validation check ensures that only valid SOC-2010 can be entered.
|SOCAssign||Run as Web Start|
|Download||Run as Java Application|
To run SOCAssign as a Java application, double click on the downloaded SOCAssign.jar file (make sure the java executable file is in the path)
Currently, the input for SOCcer is a comma-separated file with three columns: job title, SIC, and job tasks. SOCcer strictly enforces the format of the input file. The input file must contain the header line (the case must match also):
After the header line this is a separate row for each job description. There MUST be three comma-separated values on each line. If the job title or task contains a comma, the value must be in quotes or else there will be an error. An example of a valid job description is:
"Teacher, high school", 8211, "formulate lesson plans, teach 11th grade match"
Leaving out the quotation marks would cause the line to appear to have five values and will return an error. SOCcer will list all line numbers with errors and require you to fix the input before proceeding. A value may be blank (missing information), but must be included. Valid examples with missing information are:
"Teacher, high school", , "formulate lesson plans, teach 11th grade match"
,8211, "formulate lesson plans, teach 11th grade match"
"Teacher, high school",,
The SOCcer results are provided in a comma-separated file that contains the row number, job title, SIC, job tasks, and the ten highest ranked SOC codes, with corresponding SOCcer scores.
|1||"Teacher, high school"||"8211"||"formulate lesson plans, teach 11th grade match"||25-2031||0.979||25-2032||0.357||...||25-1194||0.042|
|2||"Java Developer"||"7371"||"Develop use cases, write computer software in java"||15-1131||0.959||15-1132||0.717||...||11-2031||0.034|
|Note: SOC3 through SOC9 were omitted and scores were rounded for display purposes.|
SOCcer 2.0 requires data in comma separated value (csv) format. The data format requires four variables (ID, job title, SIC, and job task), but only id and job title are necessary to obtain a SOCcer score. Job tasks help improve the performance of the classifier. SIC is no longer part of the classifier, because improvement to the results were small; however, at this time the data format still requires a valid SIC code as a placeholder (e.g., 9999 if no SIC). For task information, use commas to specify blank data.
An example of a valid file is:
myid1,"Teacher, high school",8211, "formulate lesson plans, teach 11th grade match"
myid2,, 8211,"formulate lesson plans, teach 11th grade match"
myid3,"Teacher, high school", 8211,
myid4,”high school teacher”,9999,
If the ids are left blank, they will be assigned by SOCcer as 1, 2, 3, … , <number of lines>.
If you are using Microsoft Excel, be careful about having newline characters and smart quotes in your data. They are going to cause problems with the data upload procedure.
If you have any questions, comments, or concerns, send us an e-mail at NCISOCcerWebAdmin@mail.nih.gov.
- Russ DE, Ho K-Y, Colt JS, Armenti KR, Baris D, Wong-Ho C, Davis F, Johnson A, Purdue MP, Karagas MR, Schwarz K, Schwenn M, Silverman DT, Stewart PA, Johnson, CA, Friesen MC, "Computer-based coding of free-text job descriptions to efficiently and reliably incorporate occupational risk factors into large-scale epidemiologic studies", Occup Environ Med 2016;73(6):417-24.
- Russ DE, Ho K-Y, Johnson CA, Friesen MC, "Computer-Based Coding of Occupation Codes for Epidemiological Analyses", Proc IEEE Int Symp Comput Based Med Syst 2014 , 2014, pp. 347-350.
Frequently Asked Questions
- Which jobs require expert review?
- SOCcer does not replace manual expert review for many jobs. The proportion requiring review is largely dependent on the quality of the job description data in your study. It also depends on the level of detail needed. Accuracy is higher for 3-digit SOC-2010 codes than for 6-digit SOC codes. For an accurate 6-digit score, we recommend that jobs with scores less than 0.3 be reviewed by an expert. We also recommend reviewing jobs whose two highest scoring SOC codes have similar scores (within 0.1 score units). Our companion software SOCAssign can be used to assist with the manual review.
- What is the “gold standard” for assignment?
- Best practice is to use two experts to code each job and resolve discrepancies in their coding. However, there is no gold standard. This is because the job descriptions often do not provide sufficient information to pick a single code; multiple, equally plausible codes may exist for a given job description.
- Which version of SOCcer do I use?
- For all new projects, you will have more accurate results with SOCcer v2.0. SOCcer 1.0 remains available for comparison purposes and for users who wish to recreate previous assessments that used v. 1.0.
- What can I do to improve the results of SOCcer?
- To improve SOCcer's performance, we are building a database of job descriptions linked to SOC-2010 (and other classification systems) that can be used to build and refine classifiers. If you have job descriptions that have been coded by an expert coder (or initially coded by SOCcer, then reviewed by an expert) and you are willing to provide them to us, we will be happy to include those job descriptions in our knowledge base for use in future versions of SOCcer. If the data are protected, a data use agreement may be possible. Your institute may provide guidance on data use agreements.
- What about HIPAA concerns?
- The data input file does not accept identifiers in order to help prevent you from uploading PII; however, we do not screen your data for PII. Please check your input file for PII before you upload you data onto our server.
- What are SOCcer scores?
- Our classifier uses logistic regression to calculate that log-odds that an expert reviewer would have selected a SOC 2010 code. The SOCcer score is the transformed log-odds (to a probability). In general the higher the SOCcer score, the greater the probability of matching an expert review. Please see our paper for more details on the relationship between SOCcer score and probability of matching an expert coder's SOC assignment. Some dataset are more difficult to classify than others and depend on the quality of the data. The SOCcer score distribution provides an overview of how well SOCcer performed on your data set.