Talk:PLOS/Chemical graph generators

Review by Jean-Loup Faulon

http://topicpageswiki.plos.org/w/index.php?title=Talk:Chemical_Graph_Generators&oldid=8257

This Topic Page covers well the field of chemical structure generation. The field is properly reviewed, the general concept and the main methods are simply explained. Below are specific comments given for each section.

Abstract

In the list of applications of molecular generators, one could add "design of molecules with specified properties and/or activities" (also named inverse QSAR/QSPR). Retrosynthesis should be added to organic synthesis design and linked to https://en.wikipedia.org/wiki/Retrosynthetic_analysis CASE should be linked to https://en.wikipedia.org/wiki/Computer-assisted_structure_elucidation

History section

The term graph generation should be replaced by graph enumeration and linked to https://en.wikipedia.org/wiki/Graph_enumeration Before introducing structure generators, it would be nice to add a couple of sentences on past work regarding the graph counting problem in chemistry, at least citing A. Cayley and G. Polya. In these early works, the idea of symmetry group was developed and later used by chemical structure generators. There are in fact many papers dealing with counting specific chemical families (like benzenoids, hydrocarbon cages, fullerenes, cycloalkanes, stereoisomers). These have been reviewed in doi: 10.1002/0471720895.ch3 but I’ll let the authors to decide how far they want to go in talking about the counting problem. In the sentence "His software, SIGNATURE[9], was integrated into this stochastic generator for canonical labelling and duplicate checks.[10]" the references should be exchanged; first come 10 then 9. Reference 9 should be doi: 10.1021/ci020346o instead of the current one (doi: 10.1021/ci020345w)

The reference doi: 10.1021/ci950179a should be added at the end of the sentence: "In the generation, this transformation relies on equations based on Faulon’s rules." Regarding the sentence "Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN", was a specific benchmarked carried out? Also, the comparison between MOLGEN, OMG and PMG is also covered in the method section, perhaps only one time is enough.

Mathematical basis section

In Figure 3, the term enumeration is confusing as the 3 molecules are the same, the term label should be used instead, i.e., replace "Fig 3. Molecular Symmetry. (A) The initial enumeration of 2,4-Dimethyl-2-pentene. (B) and (C) are symmetries of the same molecule with different enumerations." by "Fig 3. Molecular Symmetry. (A) The initial labeling of 2,4-Dimethyl-2-pentene. (B) and (C) are symmetries of the same molecule with different labels."

Method section

That section could be shortened as it sometime reads more like the history section, the authors should focus here on the specific methods, their pros and cons and their performances. There are also recent methods not covered in the topic page. These are aimed at generating chemical structures based on retrosynthesis (see DOI: 10.1186/s13321-017-0252-9) or are making use of deep learning to design molecules (list of non-exhaustive DOI: 10.1021/acs.molpharmaceut.7b00346 — DOI: 10.1002/minf.201700123 — DOI: 10.1021/acscentsci.7b00572 — DOI: 10.1021/acscentsci.7b00512 — DOI 10.1186/s13321-017-0235-x — DOI: 10.1126/sciadv.aap7885). These should be mentioned as novel directions the field is moving into.

Conclusion section

"there is a lack of efficient open-source structure generators." One should be careful when using the term efficient, which in computer science means runs in polynomial time. It cannot be the case when generating molecules as there might be an exponential number of solutions. One can only hope the time taken between two structures printed by the generator is polynomial (this has been named polynomial delay, https://en.wikipedia.org/wiki/Polynomial_delay). To the best of my knowledge, no one has yet shown that molecules can be generated by a polynomial delay algorithm.

Review by Jean-Marc Nuzillard

http://topicpageswiki.plos.org/w/index.php?title=Chemical_Graph_Generators&oldid=8254

This article constitutes a useful introduction to chemical structure generation, an important aspect of cheminformatics and computational biology, in relationship with metabolomics studies.

As a general remark, breaking long text blocks into paragraphs facilitates reading.

The abstract refers to drug design and to organic synthesis but the main text does not. See J. Mol. Struct. (Theochem) 165 (1988), 229-242 (or https://en.wikipedia.org/wiki/List_of_computer-assisted_organic_synthesis_software) for organic synthesis.

The limits of the sections entitled "History", "Structure assembly", and "Structure reduction" are not well-defined. Historical considerations are present everywhere, while the History section contains informations about Methods and Structure reduction. Please refer to COCON (academic, J Cheminform 3 (2011), 27)and CMC-se (Bruker TopSpin, commercial software) as available structure generators, even though only little detail is known about their algorithms. As well, StrucEluc is a commercial product under the name Structure Elucidator, available from ACD/Labs. A table stating which software is available from where for which purpose and under which conditions would be helpful to the reader.

The section about Mathematical basis needs some re-reading.

Instead of referred to multigraphs, would not it be better to refer to vertex- and edge-labeled graphs (https://en.wikipedia.org/wiki/Graph_labeling)?

It should be noticed that the vertex-and-edge description of atom assemblies is appropriate for "standard" organic molecules but is not well-suited for the representation of coordination complexes.

The drawing in Fig. 2A is already the representation of a graph, with coloured vertices.

It can be useful for a structure generator to produce a non-connected graph if the solution to a structure elucidation problem is a mixture (Phytochem. Anal. 1992, 3, 153-159).

Graph G is defined either as G(E,V) or G(V,E).

Permutation composition is defined by (ab)(x) = a(b(x)) but this definition is not consistent with the composition operator written *. (a*b)(x) = a(b(x)) should be preferred. Is X a set of permutations or a set of integers?

What is the difference between a(u, v) and a((u, v))?

As far as atom-edge canonical labelling is evoked, a reference to InChI (J. Cheminf. 7 (2015),‎ 23) and/or to ALATIS (Sci. Data 4 (2017), 170073) would be useful.

Even though chemical structure generation has a strong mathematical basis, applications to molecules of biological interest (or at least of biological origin) should be referenced.

The conclusion says that "The algorithms are generally breadth-first searches" but LSD is "depth-first" (Chin. J. Chem. 21 (2003), 1263-1267).

The rare recent bibliographic references may leave the reader with the feeling that the topic of structure generation is somewhat outdated.

Jmnuzi (talk) 09:32, 5 January 2020 (PST)

Response to Reviewers

We would like to thank to our reviewers for their careful reading and constructive comments. In the line with their suggestions, the draft has been modified.

Response to Jean-Loup Faulon

Abstract

QSAR/QSPR and retrosynthesis are added to the abstract.

History Section

The term graph generation should be replaced by graph enumeration and linked to https://en.wikipedia.org/wiki/Graph_enumeration

In this article, we focus on structure generation not enumeration. That is why we thought better to keep it as graph generation.

Before introducing structure generators, it would be nice to add a couple of sentences on past work regarding the graph counting problem in chemistry, at least citing A. Cayley and G. Polya. In these early works, the idea of symmetry group was developed and later used by chemical structure generators. There are in fact many papers dealing with counting specific chemical families (like benzenoids, hydrocarbon cages, fullerenes, cycloalkanes, stereoisomers). These have been reviewed in doi: 10.1002/0471720895.ch3 but I’ll let the authors to decide how far they want to go in talking about the counting problem.

Thanks for the reference articles and advises. For the isomer enumeration such as Polya's work, we prefer to write enumeration problem in our chemical graph theory review article not in this article.

In the sentence "His software, SIGNATURE[9], was integrated into this stochastic generator for canonical labelling and duplicate checks.[10]" the references should be exchanged; first come 10 then 9. Reference 9 should be doi: 10.1021/ci020346o instead of the current one (doi: 10.1021/ci020345w) The reference doi: 10.1021/ci950179a should be added at the end of the sentence: "In the generation, this transformation relies on equations based on Faulon’s rules."

Thanks for the correction; the references are updated.

Regarding the sentence "Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN", was a specific benchmarked carried out? Also, the comparison between MOLGEN, OMG and PMG is also covered in the method section, perhaps only one time is enough.

The benchmark, we shared in a scientific blog, its link is: https://mayphd.blogspot.com/. The blog is also cited in the article.

Mathematical Basis Section

The explanation of Fig 3 is changed as how you recommended.

Method Sections

As how you recommended, the history and methods sections were changed, and also the methods were extended in the light of your advises. Recent studies such as Prof. Funatsu's generator and deep learning based methods are referred in the article.

Conclusion Section

The conclusion part is changed.

Response to Jean-Marc Nuzillard

As a general remark, breaking long text blocks into paragraphs facilitates reading.

All the sections are updated.

The abstract refers to drug design and to organic synthesis but the main text does not. See J. Mol. Struct. (Theochem) 165 (1988), 229-242 (or https://en.wikipedia.org/wiki/List_of_computer-assisted_organic_synthesis_software) for organic synthesis.

These topics are mentioned as the application fields of these generators howewer their implementations is going to be addressed in a separate CASE review article. The software link will be added to our CASE article.

The limits of the sections entitled "History", "Structure assembly", and "Structure reduction" are not well-defined. Historical considerations are present everywhere, while the History section contains information about Methods and Structure reduction. Please refer to COCON (academic, J Cheminform 3 (2011), 27)and CMC-se (Bruker TopSpin, commercial software) as available structure generators, even though only little detail is known about their algorithms. As well, StrucEluc is a commercial product under the name Structure Elucidator, available from ACD/Labs.

We changed these two sections in the line of your and Prof. Faulon's suggestions. COCON and StrucEluc are referred in the structure assembly subsection.

A table stating which software is available from where for which purpose and under which conditions would be helpful to the reader.

The available software and their links are listed in the External Links section.

Instead of referred to multigraphs, would not it be better to refer to vertex- and edge-labelled graphs (https://en.wikipedia.org/wiki/Graph_labeling)?

Multigraphs are replaced by vertex and edge-labelled graphs and the Wikipedia page is cited.

It can be useful for a structure generator to produce a non-connected graph if the solution to a structure elucidation problem is a mixture (Phytochem. Anal. 1992, 3, 153-159).

Thanks for this article. The mixtures are mentioned in the chemical graphs section.

Graph G is defined either as G(E,V) or G(V,E).

It was a typo, corrected.

Permutation composition is defined by (ab)(x) = a(b(x)) but this definition is not consistent with the composition operator written *. (a*b)(x) = a(b(x)) should be preferred. Is X a set of permutations or a set of integers?

What is the difference between a(u, v) and a((u, v))?

The mathematical definitions were changed as you suggested.

As far as atom-edge canonical labelling is evoked, a reference to InChI (J. Cheminf. 7 (2015),‎ 23) and/or to ALATIS (Sci. Data 4 (2017), 170073) would be useful.

The articles are cited.

Even though chemical structure generation has a strong mathematical basis, applications to molecules of biological interest (or at least of biological origin) should be referenced.

The applications are going to be explained in our separate CASE article.

The conclusion says that "The algorithms are generally breadth-first searches" but LSD is "depth-first" (Chin. J. Chem. 21 (2003), 1263-1267).

That sentence is changed.

The rare recent bibliographic references may leave the reader with the feeling that the topic of structure generation is somewhat outdated.

Recent studies such as Prof. Funatsu's method articles are referred in the methods section.

Wikification

In order to prepare the article for posting to Wikipedia, I have gone through it and adapted it to Wikipedia policies and customs as best I could. In doing so, I tried not to change the content, but sometimes, linking, referencing or rewording required making certain assumptions about what was meant, so some of my changes may have introduced something the authors had not intended to state, link or cite. For these reasons, please check both the edit summaries in the page history as well as the entire change set.

There are a few things remaining to be considered in order to align the text with the English Wikipedia's style:

The images are all in PNG format, which is what PLOS wants. Wikipedia wants SVG, so as to facilitate reuse (e.g. translation into other languages), so please upload the images to Wikimedia Commons as SVGs and under a CC BY 4.0 license. In case of problems, ping me.
The table under "External links" is odd in several ways:
- in contrast to Table 1, it is not numbered
- the External links section should generally not contain tables
- such tables are common in list articles, though — perhaps adapt the table to a format similar to the one in List of chemical process simulators?
- it is not clear what the in- and exclusion criteria were for the table. For example, SMOG is mentioned in the text but not included in the table, while MOLSIG and SENECA appear in the table but are not mentioned in the text.
As far as contributions of individuals to the field are concerned, these should in general be limited to those that have a Wikipedia entry. I think that a sensible compromise is to spell out the names of each of these named individuals (which I have done) and then link these names (after posting to Wikipedia) to their existing Wikidata entry by way of the Template:Interlanguage link. I can do that after posting, so no need for action at this point.
Relatedly, the text keeps referring to specific packages as "his", when most of the relevant papers/ software had co-authors. I softened this a bit but it merits another look.
For references, we encourage the use of templates like Cite pmid that fetch all the relevant metadata. Where I found a PMID, I formatted the reference accordingly. In other cases, other templates like Cite doi should be used, but that is not relevant for the submission to PLOS and can thus be worked out later, upon posting to Wikipedia.

--Daniel Mietchen (talk) 10:35, 15 September 2020 (PDT)