Code of conduct

Lawfulness

1a. Lawful knowledge and adherence

The Data Scientist will always act in accordance with the law, developing a full knowledge of, and ensuring compliance with, all relevant regulatory regimes. Employers should take steps to raise their data scientists’ awareness and knowledge of such issues.

1b. Privacy and confidentiality protection

The Data Scientist has a duty to act so as to protect the privacy and confidentiality of data, respecting the ownership of proprietary data, and in not exposing data (within private or public fora) that might cause any harm to individuals or legal entities.

1c. Equality legislation

The Data Scientist has a duty not to break gender, race, ethnicity, martial status, religion, belief, disability, or age equality legislation. In particular, such attributes should not place individuals at any disadvantage within models or any automated decisions.

Competence

2a. Duty of Competence Development

The Data Scientist will always strive to improve his/her competence and technical excellence. For example Data Scientists should be encouraged to attend topical presentations, seminars and courses covering new advances.

Dealing with Data

3a. Activity documentation: the personal lab book

The Data Scientist will always keep a personal auditable, time based, record of his/her work in the form of a “lab book” equivalent, incorporating all data addressed/analyzed and all of their analytical activities. This should include statements of the source and provenance of all data accessed and analyzed, the methods actually employed, all discoveries and other knowhow generated, any limitations of scope and findings, and suggested potential further investigations or applications. Such a lab-book is the property of the Data Scientist’s employer.

3b. The Modelling Project Issues Log-Book

Models, applications, and projects should have an internal log-book summarizing all relevant modelling issues arising during the course of their development. The Data Scientist will document any and all known data issues that might caveat the results obtained, the performance of models/algorithms,  and future possible applications of the output models/algorithms (for example, known selection bias, the scale of data, the repurposing of existing data, and so on) in a format that can be easily made available to colleagues, managers and decision makers. Such a log-book is the property of the Data Scientist’s employer.

Log books should include the following issues where arising:

3b(ii) Protocol and documentation

The Data Scientist shall document according to a standard template each and every step along the data science value chain. This shall include the elicitation of all data sources and the usage and justification of all relevant data sources, the procedures used to combine data sources and all the steps in the data transformation pipeline.  This will also include the model selection, any procedures to tune the hyper-parameters, the employed procedure to test the model and the results, and finally the strategy to industrialize the model.

3b(iii). Data adequacy evaluation

The Data Scientist is responsible for assessing the adequacy of data to solve the particular problem and to share the results of the analysis, indicating and risks or potential implications due to lack of data quality or availability.

3b(iv). Artificial data handling

The Data Scientist is responsible for communicating all the procedures employed to make the original data more adequate for the specific problem, especially techniques intended to correct gaps in the data, to balance classification problems, e.g.  Interpolation, extrapolation, oversampling and under-sampling. As far as possible, these procedures should be peer-reviewed.

3b(v). Responsible data selection

The Data Scientist shall never cherry pick data or a model to back a particular statement, insight or outcome. Moreover, a data scientist shall always analyze the input data in order to assess it for any indicators of previous bias of this nature.

3b(vi). Inherent data bias

The Data Scientist is supposed to analyze and document potential bias present in the data and assess how this bias might affect the results and the usage of the models.

3b(vii). Surrogate features and bias

The Data Scientist is responsible for detecting and flagging features that might be surrogate to other features that violate fundamental equality rights (gender, race, religion, etc). In general proxy features need to always be checked against social discriminating features (see also 1c. Equality legislation)

3c. Original data preservation

The Data Scientist shall retain copies of the original data unaltered while keeping a record describing the set of transformations made across all of the data value chain (including ingestion, cleansing, feature extraction, scaling / normalization, feature selection, etc).

3d. Collection vs use of data

The Data Scientists need to understand the trade-off between gathering and collecting all potential data and focusing on just the data that is likely to be used to solve a particular problem. It is expected that Data Scientist’s  data gathering requests are appropriate to the problem being addressed, neither exaggerated or lacking. In any case, a Data Scientist should document the reason that a particular data set needs to be gathered.

3e. De-Identification

The Data Scientist shall not apply any technique (combination, enriching, etc) to turn information that has been designed to be “de-identifiable” into “identifiable” again.

3f. Probabilistic (inferred) information and GDPR

The Data Scientist is often able to generate more or less accurately inferred information about a person (e.g. gained over statistical similarities with other people) and will treat this information in the same way personal data subject to GDPR and ascribing the newly inferred information a score indicating how reliable it is. Moreover, algorithmic inferred information about a person shall be given the same treatment as factual information (expected under compliance with GDPR)

Algorithms and models

4a. Exhaustive algorithms: Data dredging, Data fishing, Data snooping, p-hacking

The Data Scientist is responsible for separating correlations that are the results of chance or deliberate data-mining driven searches vs. well established hypothesis-driven correlated information. Where exhaustive methods have been used to locate anomalies etc these results should be clearly declared as such, and not represented as a consequence of specific hypothesis-driven analyses, without further statistical tests.

4b. Sampling bias

A Data Scientist shall sample the data in a way the sample is as representative as possible of the population under analysis. Insights coming from the data shall be inspected for sampling bias before being made available for any decision.

4c. Survivorship bias

The Data Scientist is responsible for questioning the data before creating any model and understanding the reasons why a particular data set have passed certain filtering criteria without overlooking those data items that didn’t.

4d. Discarding unfavorable data

The Data Scientist is accountable for the consequences of discarding data that is not showing the desired outcome for the company he/she works for.

4e. Causality and correlation

The Data Scientist is responsible for clearly separating causality from correlation and explaining the consequences of wrongly establishing a causal relationship between two variables that are just correlated

4f. Crisp geolocation analysis and Gerrymandering

A data scientist shall be aware of the impact of changing geographical aggregation units. A particular case is so called “Gerrymandering”, consisting of selecting different geographical units to influence the results of elections.

4g. Big picture beyond accuracy metrics

The Data Scientist is expected to understand the big picture beyond metrics, which includes the business context, the way the model is going to be used, etc. Providing the MAE or the AUC value is not enough, yet many data scientists think their job ends there.

4h. Data Science and Publication bias

When the Data Scientist presents research evidence to substantiate any particular insights, this evidence is expected to be checked for publication bias.

4i. Accuracy vs. Explainability trade-off

The Data Scientist needs to make the right call, depending on the particular problem, between accuracy and explainability. There are situations where explainability should prevail over accuracy. Conversely, there are times when explainability is not a must have. It is expected a professional decision based on the predicted use of the model

4k. Pre-trained models re-usability

More and more data scientists consider using a third party pre-trained model (e.g.: pre-trained word embeddings – such as word2vec, Glove or fastText- or pre-trained object recognition/image classifier CNNs -Oxford VGG16, YOLO, etc-).  The Data Scientist is responsible for auditing model against all the clauses of this code of conduct.

4l. AI Reproducibility

Most of the models created by data scientists have stochastic components, meaning there is no guarantee that the same model will be produced given the same training data. Moreover, it’s a known issue, that fixing a seed to force reproducibility compromises the parallelization of the models.
The Data Scientist shall be responsible to ensure reproducibility in situations where understanding the overall behavior of the system is critical.

4m. Cold-Start Bias

A common source of bias is the cold start phase, where no data is available but the system needs to function according to a set of predefined data. The Data Scientist is responsible for pinpointing the potential limitations of any intelligent system in a ramp-up phase and how the existence of abundant data will change the output of the system.

Transparency Objectivity and Truth

5a. Transparency as a duty

The Data Scientist will strive for transparency within as wide a forum as allowable by legal and proprietary constraints. The data scientist will not withhold concerns or potential limitations form colleagues and managers.

5b. Provable objective results

The Data Scientist will make only objective assessments within any lay summaries of results and performance and recommendations of technical methods.

5c. Clear results communication

The Data Scientist will not overclaim nor present any misleading statements regarding the performance and efficacy in a summary or when stating objective facts.

5d. Transparency on quality of the results

The Data Scientist shall provide a standardized framework to demonstrate how good the resulting model is, applying industry-wide best practices (train, test and validating data sets, etc) and keeping training, test and validation data sets for proof. If required, the seed used to train the system shall also be kept to allow for reproducibility.

5e. Expectations alignment

The Data Scientist has a professional duty to correct any misunderstandings or unfounded expectations of colleagues, managers or decision makers who may rely on his/her work.

5h. Creation of a manipulative evidence

The Data Scientist shall not make use of any technique to create or assist in the creation of manipulative evidence (e.g.: psychometrics, social network analysis, etc)

Working alone and with others

6a. Collegiality

The Data Scientist will always act in a collegiate manner with colleagues. This includes disclosing any facts, assessments, or insights that may be relevant to colleagues’ own data science work.

6b. Duty to speak up

The Data Scientist has a professional duty to raise concerns over any potential breaches to this code by him/herself or by others to relevant authorities and management, usually the line manager of any person possibly involved in a breach.

(extra) Upcoming ethical challenges

7a. Adversarial Learning Manipulation

Data scientists shall not purposely employ techniques such as targeted and non-targeted adversarial attacks to manipulate the result of existing models. Moreover, it is expected that Data Scientists perform adversarial training.

7b. Responsibility on inventions

The data scientist is expected to make a professional judgement about the usage of their inventions and to gauge the benefit vs. the risk. In any case, inventions with potential to be harming, shall be protected and secured so that only beneficial usages are possible.

7c. Explainable AI a research field and a duty

The Data Scientist shall be able to explain how their algorithms work and how they come up with their predictions / outputs  (this is especially challenging in the deep learning area).

7d. Blockchain and personal data

The Data Scientist shall be aware of the implications of new decentralized data storage technologies where critical privacy protecting operations (such as physical record deletion), are not directly supported.