Showing posts with label vlife. Show all posts
Showing posts with label vlife. Show all posts

Saturday, July 12, 2014

A quick and dirty algorithm for Maximum Common Substructure

While there is vast amount of literature dedicated to algorithms for Maximum Common Substructure (MCS) search, particularly with respect to molecular graph (see for instance: http://scholar.google.co.in/scholar?hl=en&q=maximum+common+substructure&btnG=), I wanted to quickly write an algorithm without reading any literature on the subject. And thus this algorithm was written, typically over the weekend.

First consider finding an MCS between two molecular graphs:

1. Convert the molecule into a bit representation. This is akin to generating fingerprint of the molecule, but in this case a fingerprint is generated for each Atom in the following way:

  • Each Atom is represented as a bit-set of 64 bits, that will be able to capture a maximum of seven bonds that are connected to the atom under consideration.
  • Each set of 9-bits represents a connection: 7 bits for storing atomic number of connected atom type, 2 bits for bond type (0-single, 1-double, 2-triple, 3-aromatic).  
  • The order in which the bits are stored in the bit-set is always in sorted order: first sorted on atomic number and then on the bond type.
2. Match the corresponding bit atom representations (generated above)  among the two molecular graphs, and build a pair list. This matching is done in two stage: the first stage involves exact bit pattern match, where as the second stage involves partial match. When doing the partial match, the atom pair match that has the maximum bit matches is taken. The pair list thus generated becomes the seed for 'growing' the graph that would hopefully be the maximum common substructure.

3. For each of the pair list generated in step 2, start growing the graph using a breadth first traversal method, at each stage checking if the corresponding expansion in the paired molecule is also possible. The pair stops growing, when no such expansion is possible. And the algorithm stops when there is no growth for any of the pairs in the list. The pair with the maximum number of elements then forms the maximum common substructure. Use this pair to next generate the template molecular sub-graph representing MCS.  

For finding MCS among a set of molecules: 

4. Find the molecule with least number of atoms from the set, and mark it as reference molecule. Follow steps 1 to 3 for the pair of reference molecule and every other molecule in the set. Instead of just picking the pair with maximum elements in step 3, save the complete list (L). Next, pick the pair list that has least number of elements as reference (Lp). For a member in this list (iLp), search the complete list (L), to see if every list has the member iLp, if yes add it to a new list (Lnew). Repeat this procedure for every member of list Lp. Finally, the list with maximum number of elements in Lnew is the MCS for the set of molecules. 

That is it. I have tested the algorithm on different sets of congeneric series of molecules, and it pretty much works for all. For non-congeneric set however, one needs to build a cluster algorithm to find the MCS in sub group of molecules, that is work for some other weekend. 

For now, an implementation of this is available as a scripting function within VLifeMDS (www.vlifesciences.com, visit https://portal.vlifesciences.com/ to buy a license). I will also hopefully make an implementation available with MeTA Studio soon.

  

Tuesday, August 10, 2010

Moving on..

Long time since I posted on my blog! Mostly because I have been pretty busy with preparations to move on.. Yes will be taking up a new position at http://www.vlifesciences.com/ from next Monday.

So just wanted to update on a few things which have kept me engaged:

- Moving is not exactly pleasant, especially if you had not anticipated it and you have too many stuff to get rid of!

- Keeping track of NotionInk Adam, but lost a bit of touch with them due to the above!

- Trying to finish up my current work.

- Getting my thesis published as a monograph :-)


- Best of all : Roaming around on Oz land :) Hope to post in some photos..

Finally, what happens to my existing opensource contributions?

- In one line: there should be no difference, I would continue to contribute and maintain the codes during my free time. All the current licenses will be maintained.

- For MeTAStudio : I would continue to update and add new features to this project. There are also a number other contributors now, whom I guess will continue their contribution as usual.

- For codes developed at ANU:
A part of my work involved writing codes for performing Hartree-Fock calculation on molecular systems using X10 programming language being developed at IBM. These codes were jointly developed by Josh Milthorpe and myself. The portions of the code written by me are copyright of ANU but are available to public under the EPL. For more information visit: http://cs.anu.edu.au/~Josh.Milthorpe/x10.html. These codes are also linked from http://x10.codehaus.org/For+Researchers.

More updates later :)