Skip to main content
NCI logo

Standardized Occupation Coding for Computer-assisted Epidemiological Research

Welcome to SOCcer

SOCcer (Standardized Occupation Coding for Computer-assisted Epidemiological Research) is a publicly available application that was developed to assist epidemiological researchers incorporate occupational and industrial risk into their studies. The application is not intended to replace expert coders, but rather prioritizes job descriptions that would most benefit from expert coders. Low scoring job descriptions are more likely to require expert review than high scoring job description.

There are several tools in the SOCcer ecosystem:

The newest version of our occupational coding tool. Includes opportunity to crosswalk from US SOC 1980, Canadian NOC 2011/2016 and ISCO 1988.
Industry coding tool for NAICS 2022.
Tool to aid expert coding of occupations to US SOC-2010.
Previous versions of SOCcer.

Both SOCcerNET and CLIPS can be incorporated into online questionnaires to facilitate participants self-coding of their occupation and industry. Please contact Dr. Friesen (friesenmc@mail.nih.gov) and Dr. Russ (druss@mail.nih.gov) for more information on how to incorporate these tools into your studies.

If you publish results that use SOCcer, please reference: Daniel E. Russ, Pabitra Josse, Thomas Remen, Jonathan N. Hofmann, Mark P. Purdue, Jack Siemiatycki, Debra T. Silverman, Yawei Zhang, Jerome Lavoué, Melissa C. Friesen, "Evaluation of the updated SOCcer v2 algorithm for coding free-text job descriptions in three epidemiologic studies", Annals of Work Exposures and Health, Epub 18 April 2023. DOI: 10.1093/annweh/wxad020.

Single Record
File

SOCcer

Example 1 Example 2
If this job is submitted to the queue, a notification will be sent to your email address once processing is complete.

Tools


SOCcer can use different models, depending on the type of data that you have. The table below describes the models available. We recommend using SOCcerNET in all new coding efforts.

SOCcer Description
SOCcerNET SOCcerNET is a comprehensive redesign of SOCcer. This new version leverages a transformer-based neural network architecture to generate numeric embeddings from the combined job title and job task of a job description. In addition to the job title and job task, SOCcerNET can utilize an optional expert-assigned prior occupation code (U.S. SOC1980, Canadian NOC-2011/2016, and ISCO-1988) as an additional feature that is combined with the embeddings to classify the job description. The embeddings and the output of the crosswalk are fed into a dense neural network classification layer, which returns a score for each SOC code. Overall agreement with the expert for 11,943 jobs from a population-based case-control study improved from 50% to 56% at the 6-digit level and from 56% to 63% at the 5-digit level, compared to SOCcer v2. Agreement improved with increasing score (Figure 1).
CLIPS CLIPS uses small language models to code free-text industry information to NAICS 2022 codes. It uses a two-step process that first converts (embeds) the text information to numbers and then classifies using a dense classification neural network. CLIPS provides a score for each of the 689 NAICS 5-digit codes. We validated CLIPS using 1,586 jobs coded to NAICS 2022 by an expert; these jobs were selected using a random stratified selection process that covered the range of industries. Overall, the industry code with the highest score from CLIPS had a 47% agreement with the expert-assigned codes. Agreement improved with increasing score (Figure 2). In addition, the expert-assigned code was in the top 3 CLIPS-suggested codes for 66% of the jobs, in the top 6 for 75%, and in the top 10 for 80%. This tool has primarily been developed to assist participant self-coding of industry when completing online questionnaires and its utility for post interview coding has undergone only limited evaluation.
model v2.0 SOCcer v2.0 was developed using an expanded training data set comprising job descriptions from epidemiologic studies. V2’s scoring algorithm was revised to account for deviations from linearity in the maximum entropy classifiers for job title and task and interactions between classifiers. SOCcer’s v1 vs. v2 agreement with the expert for 11,943 jobs from a population-based case-control study improved from 44.5% to 50.2% at the 6-digit level and from 51.6% to 56.3% at the 5-digit level. Ties increased from 1.1% in v1 to 7.1% in v2; however, 97% of v2 ties were at scores <0.1. V2 had a stronger, less attenuated linear relationship between expert agreement and SOCcer score than v1, with v2 scores of 0.50 and 0.75 predicting 51% and 70% agreement with experts, respectively. An evaluation of the performance of the SOCcer v2.0 is described in Evaluation of the updated SOCcer v2 algorithm for coding free-text job descriptions in three epidemiologic studies, Annals of Work Exposures and Health, Epub 18 April 2023. Doi: 10.1093/annweh/wxad020.
model v1.9 We do not recommend using this model for new coding efforts. It remains available so that previous users of this model can replicate their results or use it within on-going studies for consistency. V1.9 improved upon v1.0 by incorporating job descriptions from epidemiologic studies into the training data set. However, v1.9 does not include the refinements to the algorithm that were incorporated into v2.0 to better predict agreement with expert coding. As a result, it’s scores are generally overly optimistic.
model v1.0 This model codes job descriptions to the SOC 2010 classification system as described in Computer-Based Coding of Free-Text Job Descriptions to Efficiently and Reliably Incorporate Occupational Risk Factors into Large-Scale Epidemiological Studies. This model uses the variables JobTitle, SIC, and JobTask. This model can be used even if parts of the Job Description are not available (e.g. SIC or JobTask is missing). The classifiers will assign a '0' for the missing information in the calculation of the overall SOCcer score.
Study Percent agreement by SOC hierarchical levels % % ties Median SOCcer Score (IQR)
2-digit 3-digit 5-digit 6-digit
US Renal/model 1.0 76 64 52 45 1.1 0.46 (0.24-0.77)
US Renal/model 2.0 73 63 56 50 7.4 0.41 (0.17-0.71)
US Renal/SOCcerNET 80 71 63 56 0 0.58 (0.29-0.88)
US Renal/SOCcerNET w/SOC 1980 87 81 75 68 0 0.81 (0.40-0.96)
Table 1: performance of SOCcerNET, SOCcer 1.0 and 2.0
SOCcer Model Performance
Figure 1: Agreement between expert-assigned SOC code and highest scoring SOC code for jobs in the US Renal study (n=11,943) by score obtained from SOCcer models v1.0, v2.0, and SOCcerNET
Clips Model Performance
Figure 2: Cumulative Agreement between expert-assigned NAICS 2022 5-digit code and highest scoring industry code for jobs in the US Renal study (n=11,943) by score obtained from CLIPS

SOCAssign


SOCAssign is an application to assist expert review of the top 10 SOCcer assignments for each job description to provide an expert SOC-2010 assignment. SOCAssign will read SOCcer ouput. Before importing, the SOCcer results can be preprocessed to focus the expert review on a subset of job descriptions, such as job descriptions with SOC codes that are tied for the highest score or that had low SOCcer scores. For each job description, SOCAssign will allow the selection of up to 3 SOC-2010 codes. The codes can be selected from the SOCcer output list, from a list of all SOC-2010 codes, or manually entered. A validation check ensures that only valid SOC-2010 can be entered.

SOCAssign Run as Web Start
Download Run as Java Application

To run SOCAssign as a Java application, double click on the downloaded SOCAssign.jar file (make sure the java executable file is in the path).

Help

We recomend use of SOCcerNET, which has improved agreement with expert coders. SOCcer 1.0 remains enabled for those who used it prior to the update.

Input Format

Currently, the input for SOCcer is a comma-separated file with three columns: job title, SIC, and job tasks. SOCcer strictly enforces the format of the input file. The input file must contain the header line (the case must match also):

JobTitle,SIC,JobTask

After the header line this is a separate row for each job description. There MUST be three comma-separated values on each line. If the job title or task contains a comma, the value must be in quotes or else there will be an error. An example of a valid job description is:

"Teacher, high school", 8211, "formulate lesson plans, teach 11th grade match"

Leaving out the quotation marks would cause the line to appear to have five values and will return an error. SOCcer will list all line numbers with errors and require you to fix the input before proceeding. A value may be blank (missing information), but must be included. Valid examples with missing information are:

"Teacher, high school", , "formulate lesson plans, teach 11th grade match"
,8211, "formulate lesson plans, teach 11th grade match"
"Teacher, high school",,

The SOCcer results are provided in a comma-separated file that contains the row number, job title, SIC, job tasks, and the ten highest ranked SOC codes, with corresponding SOCcer scores.

Id JobTitle SIC JobTask SOC2010_1 Score_1 SOC2010_2 Score_2 ... SOC2010_10 Score_10
1 "Teacher, high school" "8211" "formulate lesson plans, teach 11th grade match" 25-2031 0.979 25-2032 0.357 ... 25-1194 0.042
2 "Java Developer" "7371" "Develop use cases, write computer software in java" 15-1131 0.959 15-1132 0.717 ... 11-2031 0.034
Note: SOC3 through SOC9 were omitted and scores were rounded for display purposes.
We recomend use of SOCcerNET, which has improved agreement with expert coders. SOCcer 1.9 remains enabled for those who used it prior to the update.

Input Format

SOCcer 1.9 requires data in comma separated value (csv) format. The data format requires four variables (ID, job title, SIC, and job task), but only id and job title are necessary to obtain a SOCcer score. Job tasks help improve the performance of the classifier. SIC is no longer part of the classifier, because improvement to the results were small; however, at this time the data format still requires a valid SIC code as a placeholder (e.g., 9999 if no SIC). For task information, use commas to specify blank data.

An example of a valid file is:

id,jobtitle,sic,jobtask
myid1,"Teacher, high school",8211, "formulate lesson plans, teach 11th grade match"
myid2,, 8211,"formulate lesson plans, teach 11th grade match"
myid3,"Teacher, high school", 8211,
myid4,”high school teacher”,9999,

If the ids are left blank, they will be assigned by SOCcer as 1, 2, 3, … , <number of lines>.

If you are using Microsoft Excel, be careful about having newline characters and smart quotes in your data. They are going to cause problems with the data upload procedure.

References

  1. Russ DE, Ho K-Y, Colt JS, Armenti KR, Baris D, Wong-Ho C, Davis F, Johnson A, Purdue MP, Karagas MR, Schwarz K, Schwenn M, Silverman DT, Stewart PA, Johnson, CA, Friesen MC, "Computer-based coding of free-text job descriptions to efficiently and reliably incorporate occupational risk factors into large-scale epidemiologic studies", Occup Environ Med 2016;73(6):417-24. DOI: 10.1136/oemed-2015-103152 [Pubmed Central]
  2. Russ DE, Ho K-Y, Johnson CA, Friesen MC, "Computer-Based Coding of Occupation Codes for Epidemiological Analyses", Proc IEEE Int Symp Comput Based Med Syst 2014 , 2014, pp. 347-350. DOI: 10.1109/CBMS.2014.79 [Pubmed Central]
  3. Daniel E. Russ, Pabitra Josse, Thomas Remen, Jonathan N. Hofmann, Mark P. Purdue, Jack Siemiatycki, Debra T. Silverman, Yawei Zhang, Jerome Lavoué, Melissa C. Friesen, "Evaluation of the updated SOCcer v2 algorithm for coding free-text job descriptions in three epidemiologic studies", Annals of Work Exposures and Health, 2023, Epub 18 April 2023. DOI: 10.1093/annweh/wxad020
We recomend use of SOCcerNET, which has improved agreement with expert coders. SOCcer 2.0 remains enabled for those who used it prior to the update.

Input Format

SOCcer 2.0 requires data in comma separated value (csv) format. The data format requires four variables (ID, job title, SIC, and job task), but only id and job title are necessary to obtain a SOCcer score. Job tasks help improve the performance of the classifier. SIC is no longer part of the classifier, because improvement to the results were small; however, at this time the data format still requires a valid SIC code as a placeholder (e.g., 9999 if no SIC). For task information, use commas to specify blank data.

An example of a valid file is:

id,jobtitle,sic,jobtask
myid1,"Teacher, high school",8211, "formulate lesson plans, teach 11th grade match"
myid2,, 8211,"formulate lesson plans, teach 11th grade match"
myid3,"Teacher, high school", 8211,
myid4,”high school teacher”,9999,

If the ids are left blank, they will be assigned by SOCcer as 1, 2, 3, … , <number of lines>.

If you are using Microsoft Excel, be careful about having newline characters and smart quotes in your data. They are going to cause problems with the data upload procedure.

If you have any questions, comments, or concerns, send us an e-mail at NCI­SOCcer­Web­Admin@mail.nih.gov.

References

  1. Russ DE, Ho K-Y, Colt JS, Armenti KR, Baris D, Wong-Ho C, Davis F, Johnson A, Purdue MP, Karagas MR, Schwarz K, Schwenn M, Silverman DT, Stewart PA, Johnson, CA, Friesen MC, "Computer-based coding of free-text job descriptions to efficiently and reliably incorporate occupational risk factors into large-scale epidemiologic studies", Occup Environ Med 2016;73(6):417-24. DOI: 10.1136/oemed-2015-103152 [Pubmed Central]
  2. Russ DE, Ho K-Y, Johnson CA, Friesen MC, "Computer-Based Coding of Occupation Codes for Epidemiological Analyses", Proc IEEE Int Symp Comput Based Med Syst 2014 , 2014, pp. 347-350. DOI: 10.1109/CBMS.2014.79 [Pubmed Central]
  3. Daniel E. Russ, Pabitra Josse, Thomas Remen, Jonathan N. Hofmann, Mark P. Purdue, Jack Siemiatycki, Debra T. Silverman, Yawei Zhang, Jerome Lavoué, Melissa C. Friesen, "Evaluation of the updated SOCcer v2 algorithm for coding free-text job descriptions in three epidemiologic studies", Annals of Work Exposures and Health, 2023, Epub 18 April 2023. DOI: 10.1093/annweh/wxad020

Input Format

SOCcerNET requires data in either comma separated value (csv) format or Excel format (xlsx). The data format requires a column named JobTitle that contains the job title and JobTask that contains the job task. The column order is not important. Optionally, you can also provide provide columns soc1980, noc2011, or isco1988 that contain expert-assigned prior occupation codes. Other columns are considered metadata and ignored, but they are copied to the outout. An id column is recommended. If you do not provide one, an id of the form row-XXXXX where XXXXX is the row number will be created. The column names are case sensitive. An example of a valid csv file is:

Id,JobTitle,JobTask
studyA-01,Software Engineer,Develop and maintain software applications
studyA-02,Marketing Specialist,Develop and execute marketing campaigns
studyA-03,Human Resources Generalist,Administer employee benefits and payroll
studyA-04,Financial Analyst,Analyze financial data and prepare reports
studyA-05,Project Manager,Define project scope, goals, and deliverables

For CLIPS, the input format is the same, but the column containing the products made or services provided must be named products_services instead of JobTitle and the JobTask column is not used.

Id,products_services,sic1987
studyB-001,"Developing and publishing video games for console and PC platforms, including ongoing service updates.",7372
studyB-002,"Full-service dental care, including routine checkups, cleanings, and cosmetic procedures.",8021
studyB-003,Manufacturing commercial-grade pre-packaged frozen pizzas and other frozen entrées.,2038
studyB-004,"Local passenger transit via bus routes, including fixed-schedule and express services.",4111
studyB-005,"Wholesale distribution of industrial chemicals, solvents, and raw plastic materials to manufacturing clients.",5169
studyB-006,"Residential and commercial building construction, specializing in mid-rise office buildings.",1542
studyB-007,Providing short-term consumer loans and secured lending options.,6141
studyB-008,"Operating a full-service, sit-down restaurant offering American cuisine for lunch and dinner.",5812
studyB-009,Manufacturing and assembling complex electrical wiring harnesses and cable sets for the automotive industry.,3629
studyB-010,"Private investigative services, including surveillance, background checks, and corporate fraud investigation.",7381
By default, SOCcerNET and CLIPS will download the results as an Excel file. You can choose to download the results as a csv file, but be careful not to open the csv file in Excel, which will read some of the SOC codes as dates.

If you have any questions, comments, or concerns, send us an e-mail at NCI­SOCcer­Web­Admin@mail.nih.gov.

Frequently Asked Questions


Which jobs require expert review?
SOCcer does not replace manual expert review for many jobs. The proportion requiring review is largely dependent on the quality of the job description data in your study. It also depends on the level of detail needed. Accuracy is higher for 3-digit SOC-2010 codes than for 6-digit SOC codes. For an accurate 6-digit score, we recommend that jobs with scores less than 0.3 be reviewed by an expert. We also recommend reviewing jobs whose two highest scoring SOC codes have similar scores (within 0.1 score units). Our companion software SOCAssign can be used to assist with the manual review.
What is the “gold standard” for assignment?
Best practice is to use two experts to code each job and resolve discrepancies in their coding. However, there is no gold standard. This is because the job descriptions often do not provide sufficient information to pick a single code; multiple, equally plausible codes may exist for a given job description.
Which version of SOCcer do I use?
For all new projects, you will have more accurate results with SOCcerNET. Previous version remain available for comparison purposes and for users who wish to recreate previous assessments that used v. 1.0.
What can I do to improve the results of SOCcer?
To improve SOCcer's performance, we are building a database of job descriptions linked to SOC-2010 (and other classification systems) that can be used to build and refine classifiers. If you have job descriptions that have been coded by an expert coder (or initially coded by SOCcer, then reviewed by an expert) and you are willing to provide them to us, we will be happy to include those job descriptions in our knowledge base for use in future versions of SOCcer. If the data are protected, a data use agreement may be possible. Your institute may provide guidance on data use agreements.
What about HIPAA concerns?

SOCcerNET and CLIPS: The data files never leave the user's browser and never reside on any NIH/NCI server, even temporarily.

SOCcer v1 and v2: The data input file does not accept identifiers in order to help prevent you from uploading PII; however, we do not screen your data for PII. Please check your input file for PII before you upload you data onto our server.

What are SOCcer/CLIPS scores?
The SOCcer/CLIPS scores are an estimate of the probability that the suggested code would be selected by an expert coder. In general, the higher the SOCcer score, the greater the probability of matching an expert. Please see and , and our publications, for more details on the relationship between these scores and the probability of matching an expert coder's assignment. Some datasets are more difficult to classify than others and depends on the quality of the data. The score distributions provides an overview of how well SOCcer performed on your data set.