The Data Scientist will always act in accordance with the law, developing a full knowledge of, and ensuring compliance with, all relevant regulatory regimes. Employers should take steps to raise their data scientists’ awareness and knowledge of such issues.
Code of conduct
Lawfulness
1b. Privacy and confidentiality protection
1c. Equality legislation
Competence
2a. Duty of Competence Development
2b. Competence communication transparency
Dealing with Data
3a. Activity documentation: the personal lab book
The Data Scientist will always keep a personal auditable, time based, record of his/her work in the form of a “lab book” equivalent, incorporating all data addressed/analyzed and all of their analytical activities. This should include statements of the source and provenance of all data accessed and analyzed, the methods actually employed, all discoveries and other knowhow generated, any limitations of scope and findings, and suggested potential further investigations or applications. Such a lab-book is the property of the Data Scientist’s employer.
3b. The Modelling Project Issues Log-Book
Models, applications, and projects should have an internal log-book summarizing all relevant modelling issues arising during the course of their development. The Data Scientist will document any and all known data issues that might caveat the results obtained, the performance of models/algorithms, and future possible applications of the output models/algorithms (for example, known selection bias, the scale of data, the repurposing of existing data, and so on) in a format that can be easily made available to colleagues, managers and decision makers. Such a log-book is the property of the Data Scientist’s employer.
Log books should include the following issues where arising:
3b(i) Accuracy importance depending on the nature of the problem
3b(ii) Protocol and documentation
The Data Scientist shall document according to a standard template each and every step along the data science value chain. This shall include the elicitation of all data sources and the usage and justification of all relevant data sources, the procedures used to combine data sources and all the steps in the data transformation pipeline. This will also include the model selection, any procedures to tune the hyper-parameters, the employed procedure to test the model and the results, and finally the strategy to industrialize the model.
3b(iii). Data adequacy evaluation
3b(iv). Artificial data handling
The Data Scientist is responsible for communicating all the procedures employed to make the original data more adequate for the specific problem, especially techniques intended to correct gaps in the data, to balance classification problems, e.g. Interpolation, extrapolation, oversampling and under-sampling. As far as possible, these procedures should be peer-reviewed.
3b(v). Responsible data selection
3b(vi). Inherent data bias
3b(vii). Surrogate features and bias
The Data Scientist is responsible for detecting and flagging features that might be surrogate to other features that violate fundamental equality rights (gender, race, religion, etc). In general proxy features need to always be checked against social discriminating features (see also 1c. Equality legislation)
3c. Original data preservation
3d. Collection vs use of data
The Data Scientists need to understand the trade-off between gathering and collecting all potential data and focusing on just the data that is likely to be used to solve a particular problem. It is expected that Data Scientist’s data gathering requests are appropriate to the problem being addressed, neither exaggerated or lacking. In any case, a Data Scientist should document the reason that a particular data set needs to be gathered.
3e. De-Identification
3f. Probabilistic (inferred) information and GDPR
The Data Scientist is often able to generate more or less accurately inferred information about a person (e.g. gained over statistical similarities with other people) and will treat this information in the same way personal data subject to GDPR and ascribing the newly inferred information a score indicating how reliable it is. Moreover, algorithmic inferred information about a person shall be given the same treatment as factual information (expected under compliance with GDPR)
Algorithms and models
4a. Exhaustive algorithms: Data dredging, Data fishing, Data snooping, p-hacking
The Data Scientist is responsible for separating correlations that are the results of chance or deliberate data-mining driven searches vs. well established hypothesis-driven correlated information. Where exhaustive methods have been used to locate anomalies etc these results should be clearly declared as such, and not represented as a consequence of specific hypothesis-driven analyses, without further statistical tests.
4b. Sampling bias
4c. Survivorship bias
4d. Discarding unfavorable data
4e. Causality and correlation
4f. Crisp geolocation analysis and Gerrymandering
4g. Big picture beyond accuracy metrics
4h. Data Science and Publication bias
4i. Accuracy vs. Explainability trade-off
The Data Scientist needs to make the right call, depending on the particular problem, between accuracy and explainability. There are situations where explainability should prevail over accuracy. Conversely, there are times when explainability is not a must have. It is expected a professional decision based on the predicted use of the model
4j. Mandatory documentation of accuracy and precision and fit-for-purposeness
4k. Pre-trained models re-usability
More and more data scientists consider using a third party pre-trained model (e.g.: pre-trained word embeddings – such as word2vec, Glove or fastText- or pre-trained object recognition/image classifier CNNs -Oxford VGG16, YOLO, etc-). The Data Scientist is responsible for auditing model against all the clauses of this code of conduct.
4l. AI Reproducibility
Most of the models created by data scientists have stochastic components, meaning there is no guarantee that the same model will be produced given the same training data. Moreover, it’s a known issue, that fixing a seed to force reproducibility compromises the parallelization of the models.
The Data Scientist shall be responsible to ensure reproducibility in situations where understanding the overall behavior of the system is critical.
4m. Cold-Start Bias
A common source of bias is the cold start phase, where no data is available but the system needs to function according to a set of predefined data. The Data Scientist is responsible for pinpointing the potential limitations of any intelligent system in a ramp-up phase and how the existence of abundant data will change the output of the system.
4n. Prejudices and attempts against fundamental rights
Transparency Objectivity and Truth
5a. Transparency as a duty
5b. Provable objective results
5c. Clear results communication
5d. Transparency on quality of the results
The Data Scientist shall provide a standardized framework to demonstrate how good the resulting model is, applying industry-wide best practices (train, test and validating data sets, etc) and keeping training, test and validation data sets for proof. If required, the seed used to train the system shall also be kept to allow for reproducibility.