Best development practices
- Introduction
- Software life cycle
- Software management plan
- Software forges
- Documentation
- Some examples of physics and chemistry tools
Introduction
Implementing good development practices, whatever the size of the code, is essential to facilitate the work of a scientist throughout the software’s lifecycle and to help ensure its longevity. This becomes essential when code is developed collaboratively. These best practices are based in particular on tools that help to:
- managing different versions, upgrades and collaborative development through the use of version management systems (e.g. git) and software forges (e.g. gitlab)
- create user and developer documentation using documentation generation tools
- follow construction/compilation/packaging standards (depending on the programming language) to facilitate portability and deployment
- use the software just by using notebook services.
To go further, the Linux Foundation‘s Open Source Security Foundation (OpenSSF) provides a list of basic best practices criteria (passing badge) for free and open source software projects (FLOSS).
Software life cycle
The software life cycle is very similar to the data life cycle, with similar steps:
- planning, which can involve producing a software management plan
- publication and opening of codes to the community (see corresponding page on Datacc)
The image below shows all the steps in the software life cycle.
source: Violaine Louvet, CNRS
Scientific questions guide development, which is fully integrated into the research process. Cycles and sub-cycles are iterative and interlocking. Depending on the context, certain steps may not exist or may only be sketched out (e.g. testing and continuous integration). The timeframe is highly variable: the cycle may be interrupted for a short or long period (dormant code, even dead) and then resume if there is a renewed scientific interest. To discover more, see the Research Software Lifecycle published by EOSC in 2023. (DOI10.5281/zenodo.8324828)
Software management plan
To anticipate issues concerning the development and use of code and software, a software management plan can be produced. This document enables you to think about all the steps in the software life cycle beforehand. It gives recommendations on best practice for developing and distributing code or software. For research projects involving the creation of codes, a link is strongly recommended between the software management plan and the corresponding data management plan. When creating a data management plan, it is possible to indicate a code or software as the research product. For this specific research product, the researcher will answer specific questions corresponding to the production of codes. In particular, the last question in the software management plan provides a link with the data management plan. The software management plan is scheduled to be integrated into the Opidor DMP platform. To find out more, read this post.
Software forges
Doumentation
Documenting a code is essential:
- for its authors to ensure that developments over time are properly understood
- for potential contributors
- for its users: this may involve several documents, depending on the level of user.
In accordance with good development practice, the following documents should be associated with the code files:
- README: the entry point for the code. This file is automatically displayed when software forges are used. It provides an overview of all the relevant information relating to the software
- Eventually, the CONTRIBUTING (explaining how to contribute to the code), CHANGELOG (to describe major changes) and CITATION (to explain how to cite the code in a publication) files
- LICENSE: the software licence.
It should be noted that GitLab makes it very easy to integrate these files into a project, using pre-filled templates. In addition to these generic files, documentation generation tools can be used to access technical documentation elements from the code itself. Here are a few examples of popular tools that automatically create complete documentation:
It should also be noted that, when using a gitlab forge, it is possible to use the gitlab-pages mechanism to generate web pages that can format the code documentation. Some information on this subject is available here. The gitlab forges also include a wiki functionality that can be used to organise code documentation. Some information on this subject is available here.
Some examples of physics and chemistry tools
Physics
Astrophysics Source Code Library (ASCL) is a free online registry and repository for source codes of interest to astronomers and astrophysicists, including solar system astronomers, and lists codes that have been used in research that has appeared in, or been submitted to, peer-reviewed publications. The ASCL is indexed by the SAO/NASA Astrophysics Data System (ADS) and Web of Science and is citable by using the unique ascl ID assigned to each code.
Developed in 2019, published in 2021 and awarded the Open Source Software Prize in 2023 as part of the national Open Science Plan, Fink provides services for astrophysics researchers working on variable and transient phenomena. It is supported by a community of astrophysicists.
Gammapy is an open-source Python package for gamma-ray astronomy built on Numpy, Scipy and Astropy. It is used as core library for the Science Analysis tools of the Cherenkov Telescope Array Observatory (CTAO), recommended by the H.E.S.S. collaboration to be used for Science publications, and is already widely used in the analysis of existing gamma-ray instruments, such as MAGIC, VERITAS and HAWC.
Another tool awarded an Open Source Software Prize in 2023 and developed since 2014, Smilei is a simulation tool for hot plasma physics on supercomputers with numerous physics applications.
WebObs, a real-time observation tool for natural phenomena used by volcanological and seismological observatories, was created in 2001 and won an Open Science prize in 2022.
For further information, a Wikipedia page devoted to physics software provides access to around a hundred links to codes and software in this field.
Chemistry
Chemistry Development Kit (CDK) is a toolkit developed in Java and released under the GNU Lesser General Public License (LGPL). It is designed to perform computational chemistry and biochemistry operations: reading and writing data formats used in chemistry, algorithms for molecular graphs, QSAR (Quantitative Structure Activity Relationship) descriptors, working on chemical structures, etc. An article published in the Journal of cheminformatics describes the main features of version 2.0. Resources (book, wiki) are available on Github.
Open Babel is an open source project aimed at converting data in different formats (over a hundred formats are supported). It provides a toolbox for converting and analysing molecular modelling data in organic and inorganic chemistry and biochemistry. It can filter files using the SMARTS molecular pattern description language. The project’s website and the GitHub repository provide access to the code and documentation. Pybel, makes it possible to include Open Babel features into Python scripts.
Created in 2006, RDKit is an open source toolkit/software developed in C++. It features a number of functions used in cheminformatics (substructure search, 2D and 3D molecular operations, generation of descriptors for machine learning, etc.). It is distributed under a BSD license. The source code is available on Github, and a blog presents the latest developments and tips.
ChemPy is an open source Python package used in physical, organic, inorganic or analytical chemistry for applications such as chemical kinetics, multiphase equilibrium calculations, etc. The code is released under a BSD licence. For more information: the GitHub repository and the project blog.
Some tools have more specific features (spectra analysis, optical recognition of chemical structure, text mining, etc.). SpectraFit is an open source package developed in Python and released under the BSD license. It can be used on the command line or in a Jupyter environment to fit spectra in spectroscopy, for example. This article describes the features and workflow of the tool.
There are many optical chemical structure recognition (OCSR) tools available. This article, published in Journal of Cheminformatics, provides a literature review and a benchmark of available tools.
Other tools use packages developed for machine learning to create applications for the field of chemistry. Here are a few examples:
ChemML is a toolkit, developed in Python, designed for analysis, data mining and modeling for chemistry and materials science. It relies on generic machine learning libraries (available via the Anaconda distribution) and chemistry-specific libraries. It is distributed under the BSD license. An article published in Wires Computational Molecular Science presents the package and its applications.
OpenChem is a package released under the MIT license. It is based on the PyTorch deep learning library and was developed for applications in computational chemistry and drug design. An article published in Journal of Chemical Information and Modeling describes the models used and offers case studies. See also GitHub repository for more information.