Skip to main content
NCI logo

Standardized Occupation Coding for Computer-assisted Epidemiological Research

Welcome to SOCcer

SOCcer (Standardized Occupation Coding for Computer-assisted Epidemiological Research) is a publicly available application that was developed to assist epidemiological researchers incorporate occupational risk into their studies. The application is not intended to replace expert coders, but rather prioritizes job descriptions that would most benefit from expert coders. Low scoring job descriptions are more likely to require expert review than high scoring job description. The coding is performed using an ensemble classifier, which combines the results of multiple classifiers to produce a single classifier that performs better than any single classifier in the ensemble.

If you publish results that use SOCcer, please reference: Daniel E. Russ, Pabitra Josse, Thomas Remen, Jonathan N. Hofmann, Mark P. Purdue, Jack Siemiatycki, Debra T. Silverman, Yawei Zhang, Jerome Lavoué, Melissa C. Friesen, "Evaluation of the updated SOCcer v2 algorithm for coding free-text job descriptions in three epidemiologic studies", Annals of Work Exposures and Health, Epub 18 April 2023. DOI: 10.1093/annweh/wxad020.

SOCcer

Example 1 Example 2
If this job is submitted to the queue, a notification will be sent to your email address once processing is complete.

Models


SOCcer v2.0 was released in February 2019. At an undetermined time between its release and 3/2023, it was inadvertently replaced with an older, developmental model (v1.9) and incorrectly listed as v2.0 on the web portal. Both v1.9 and v2.0 are now correctly available online. If you used “v2” in the specified time frame, you can determine whether you actually used v1.9 or v2 by re-running your data through both versions SOCcer and seeing which output matches the original output you received. A check of the number of ties in the top two scoring codes can be a quick indication: if the number of ties is <1%, it is likely you used v1.9, and if >1% then it is likely you used v2.0. See below for a description of the model and performance differences between v1, v1.9, and v2. We recommend using v2.0 in all new coding efforts.

SOCcer can use different models, depending on the type of data that you have. The table below describes the models available.

Model Description
model v2.0 SOCcer v2.0 was developed using an expanded training data set comprising job descriptions from epidemiologic studies. V2’s scoring algorithm was revised to account for deviations from linearity in the maximum entropy classifiers for job title and task and interactions between classifiers. SOCcer’s v1 vs. v2 agreement with the expert for 11,943 jobs from a population-based case-control study improved from 44.5% to 50.2% at the 6-digit level and from 51.6% to 56.3% at the 5-digit level. Ties increased from 1.1% in v1 to 7.1% in v2; however, 97% of v2 ties were at scores <0.1. V2 had a stronger, less attenuated linear relationship between expert agreement and SOCcer score than v1, with v2 scores of 0.50 and 0.75 predicting 51% and 70% agreement with experts, respectively. An evaluation of the performance of the SOCcer v2.0 is described in Evaluation of the updated SOCcer v2 algorithm for coding free-text job descriptions in three epidemiologic studies, Annals of Work Exposures and Health, Epub 18 April 2023. Doi: 10.1093/annweh/wxad020.
model v1.9 We do not recommend using this model for new coding efforts. It remains available so that previous users of this model can replicate their results or use it within on-going studies for consistency. V1.9 improved upon v1.0 by incorporating job descriptions from epidemiologic studies into the training data set. However, v1.9 does not include the refinements to the algorithm that were incorporated into v2.0 to better predict agreement with expert coding. As a result, it’s scores are generally overly optimistic. See Figure 1 below.
model v1.0 This model codes job descriptions to the SOC 2010 classification system as described in Computer-Based Coding of Free-Text Job Descriptions to Efficiently and Reliably Incorporate Occupational Risk Factors into Large-Scale Epidemiological Studies. This model uses the variables JobTitle, SIC, and JobTask. This model can be used even if parts of the Job Description are not available (e.g. SIC or JobTask is missing). The classifiers will assign a '0' for the missing information in the calculation of the overall SOCcer score.
Study Percent agreement by SOC hierarchical levels % % ties Median SOCcer Score (IQR)
2-digit 3-digit 5-digit 6-digit
US Renal/model 1.0 76 64 52 45 1.1 0.46 (0.24-0.77)
US Renal/model 2.0 73 63 56 50 7.4 0.41 (0.17-0.71)
Table 1: performance of SOCcer model v1 and v2 from Evaluation of the updated SOCcer v2 algorithm for coding free-text job descriptions in three epidemiologic studies, Annals of Work Exposures and Health, in press, 2023
SOCcer Model Performance
Figure 1: Agreement between expert-assigned SOC code and highest scoring SOC code for jobs in the US Renal study (n=11,943) by score obtained from SOCcer models v1.0, v1.9, and 2.0

SOCAssign


SOCAssign is an application to assist expert review of the top 10 SOCcer assignments for each job description to provide an expert SOC-2010 assignment. SOCAssign will read SOCcer ouput. Before importing, the SOCcer results can be preprocessed to focus the expert review on a subset of job descriptions, such as job descriptions with SOC codes that are tied for the highest score or that had low SOCcer scores. For each job description, SOCAssign will allow the selection of up to 3 SOC-2010 codes. The code scan be selected from the SOCcer output list, from a list of all SOC-2010 codes, or manually entered. A validation check ensures that only valid SOC-2010 can be entered.

SOCAssign Run as Web Start
Download Run as Java Application

To run SOCAssign as a Java application, double click on the downloaded SOCAssign.jar file (make sure the java executable file is in the path).

Help

We recomend use of the most recent version, SOCcer 2.0, which has improved accuracy. SOCcer 1.0 remains enabled for those who used it prior to the update.

Input Format

Currently, the input for SOCcer is a comma-separated file with three columns: job title, SIC, and job tasks. SOCcer strictly enforces the format of the input file. The input file must contain the header line (the case must match also):

JobTitle,SIC,JobTask

After the header line this is a separate row for each job description. There MUST be three comma-separated values on each line. If the job title or task contains a comma, the value must be in quotes or else there will be an error. An example of a valid job description is:

"Teacher, high school", 8211, "formulate lesson plans, teach 11th grade match"

Leaving out the quotation marks would cause the line to appear to have five values and will return an error. SOCcer will list all line numbers with errors and require you to fix the input before proceeding. A value may be blank (missing information), but must be included. Valid examples with missing information are:

"Teacher, high school", , "formulate lesson plans, teach 11th grade match"
,8211, "formulate lesson plans, teach 11th grade match"
"Teacher, high school",,

The SOCcer results are provided in a comma-separated file that contains the row number, job title, SIC, job tasks, and the ten highest ranked SOC codes, with corresponding SOCcer scores.

Id JobTitle SIC JobTask SOC2010_1 Score_1 SOC2010_2 Score_2 ... SOC2010_10 Score_10
1 "Teacher, high school" "8211" "formulate lesson plans, teach 11th grade match" 25-2031 0.979 25-2032 0.357 ... 25-1194 0.042
2 "Java Developer" "7371" "Develop use cases, write computer software in java" 15-1131 0.959 15-1132 0.717 ... 11-2031 0.034
Note: SOC3 through SOC9 were omitted and scores were rounded for display purposes.
We recomend use of the most recent version, SOCcer 2.0, which has improved accuracy. See the warning under the Models tab.

Input Format

SOCcer 1.9 requires data in comma separated value (csv) format. The data format requires four variables (ID, job title, SIC, and job task), but only id and job title are necessary to obtain a SOCcer score. Job tasks help improve the performance of the classifier. SIC is no longer part of the classifier, because improvement to the results were small; however, at this time the data format still requires a valid SIC code as a placeholder (e.g., 9999 if no SIC). For task information, use commas to specify blank data.

An example of a valid file is:

id,jobtitle,sic,jobtask
myid1,"Teacher, high school",8211, "formulate lesson plans, teach 11th grade match"
myid2,, 8211,"formulate lesson plans, teach 11th grade match"
myid3,"Teacher, high school", 8211,
myid4,”high school teacher”,9999,

If the ids are left blank, they will be assigned by SOCcer as 1, 2, 3, … , <number of lines>.

If you are using Microsoft Excel, be careful about having newline characters and smart quotes in your data. They are going to cause problems with the data upload procedure.

References

  1. Russ DE, Ho K-Y, Colt JS, Armenti KR, Baris D, Wong-Ho C, Davis F, Johnson A, Purdue MP, Karagas MR, Schwarz K, Schwenn M, Silverman DT, Stewart PA, Johnson, CA, Friesen MC, "Computer-based coding of free-text job descriptions to efficiently and reliably incorporate occupational risk factors into large-scale epidemiologic studies", Occup Environ Med 2016;73(6):417-24. DOI: 10.1136/oemed-2015-103152 [Pubmed Central]
  2. Russ DE, Ho K-Y, Johnson CA, Friesen MC, "Computer-Based Coding of Occupation Codes for Epidemiological Analyses", Proc IEEE Int Symp Comput Based Med Syst 2014 , 2014, pp. 347-350. DOI: 10.1109/CBMS.2014.79 [Pubmed Central]
  3. Daniel E. Russ, Pabitra Josse, Thomas Remen, Jonathan N. Hofmann, Mark P. Purdue, Jack Siemiatycki, Debra T. Silverman, Yawei Zhang, Jerome Lavoué, Melissa C. Friesen, "Evaluation of the updated SOCcer v2 algorithm for coding free-text job descriptions in three epidemiologic studies", Annals of Work Exposures and Health, 2023, Epub 18 April 2023. DOI: 10.1093/annweh/wxad020

Input Format

SOCcer 2.0 requires data in comma separated value (csv) format. The data format requires four variables (ID, job title, SIC, and job task), but only id and job title are necessary to obtain a SOCcer score. Job tasks help improve the performance of the classifier. SIC is no longer part of the classifier, because improvement to the results were small; however, at this time the data format still requires a valid SIC code as a placeholder (e.g., 9999 if no SIC). For task information, use commas to specify blank data.

An example of a valid file is:

id,jobtitle,sic,jobtask
myid1,"Teacher, high school",8211, "formulate lesson plans, teach 11th grade match"
myid2,, 8211,"formulate lesson plans, teach 11th grade match"
myid3,"Teacher, high school", 8211,
myid4,”high school teacher”,9999,

If the ids are left blank, they will be assigned by SOCcer as 1, 2, 3, … , <number of lines>.

If you are using Microsoft Excel, be careful about having newline characters and smart quotes in your data. They are going to cause problems with the data upload procedure.

If you have any questions, comments, or concerns, send us an e-mail at NCI­SOCcer­Web­Admin@mail.nih.gov.

References

  1. Russ DE, Ho K-Y, Colt JS, Armenti KR, Baris D, Wong-Ho C, Davis F, Johnson A, Purdue MP, Karagas MR, Schwarz K, Schwenn M, Silverman DT, Stewart PA, Johnson, CA, Friesen MC, "Computer-based coding of free-text job descriptions to efficiently and reliably incorporate occupational risk factors into large-scale epidemiologic studies", Occup Environ Med 2016;73(6):417-24. DOI: 10.1136/oemed-2015-103152 [Pubmed Central]
  2. Russ DE, Ho K-Y, Johnson CA, Friesen MC, "Computer-Based Coding of Occupation Codes for Epidemiological Analyses", Proc IEEE Int Symp Comput Based Med Syst 2014 , 2014, pp. 347-350. DOI: 10.1109/CBMS.2014.79 [Pubmed Central]
  3. Daniel E. Russ, Pabitra Josse, Thomas Remen, Jonathan N. Hofmann, Mark P. Purdue, Jack Siemiatycki, Debra T. Silverman, Yawei Zhang, Jerome Lavoué, Melissa C. Friesen, "Evaluation of the updated SOCcer v2 algorithm for coding free-text job descriptions in three epidemiologic studies", Annals of Work Exposures and Health, 2023, Epub 18 April 2023. DOI: 10.1093/annweh/wxad020

Frequently Asked Questions


Which jobs require expert review?
SOCcer does not replace manual expert review for many jobs. The proportion requiring review is largely dependent on the quality of the job description data in your study. It also depends on the level of detail needed. Accuracy is higher for 3-digit SOC-2010 codes than for 6-digit SOC codes. For an accurate 6-digit score, we recommend that jobs with scores less than 0.3 be reviewed by an expert. We also recommend reviewing jobs whose two highest scoring SOC codes have similar scores (within 0.1 score units). Our companion software SOCAssign can be used to assist with the manual review.
What is the “gold standard” for assignment?
Best practice is to use two experts to code each job and resolve discrepancies in their coding. However, there is no gold standard. This is because the job descriptions often do not provide sufficient information to pick a single code; multiple, equally plausible codes may exist for a given job description.
Which version of SOCcer do I use?
For all new projects, you will have more accurate results with SOCcer v2.0. SOCcer 1.0 remains available for comparison purposes and for users who wish to recreate previous assessments that used v. 1.0.
What can I do to improve the results of SOCcer?
To improve SOCcer's performance, we are building a database of job descriptions linked to SOC-2010 (and other classification systems) that can be used to build and refine classifiers. If you have job descriptions that have been coded by an expert coder (or initially coded by SOCcer, then reviewed by an expert) and you are willing to provide them to us, we will be happy to include those job descriptions in our knowledge base for use in future versions of SOCcer. If the data are protected, a data use agreement may be possible. Your institute may provide guidance on data use agreements.
What about HIPAA concerns?
The data input file does not accept identifiers in order to help prevent you from uploading PII; however, we do not screen your data for PII. Please check your input file for PII before you upload you data onto our server.
What are SOCcer scores?
Our classifier uses logistic regression to calculate that log-odds that an expert reviewer would have selected a SOC 2010 code. The SOCcer score is the transformed log-odds (to a probability). In general the higher the SOCcer score, the greater the probability of matching an expert review. Please see our paper for more details on the relationship between SOCcer score and probability of matching an expert coder's SOC assignment. Some dataset are more difficult to classify than others and depend on the quality of the data. The SOCcer score distribution provides an overview of how well SOCcer performed on your data set.