The Sperrylite Dataset
The Sperrylite Dataset is a complete collection of high-quality structures of protein-bound ligand conformations extracted from the PDB. It consists of a total of 10,936 high-quality structures of 4548 unique ligands and hence offers a unique resource for the study of protein-bound ligand conformations.
The Sperrylite Dataset was compiled with a recently published cheminformatics pipeline that automatically (i) prepares the chemical structures of small molecules by taking into account the protein environment (in order to determine, e.g., the most likely tautomeric and protonation states); (ii) removes undesirable molecules such as crystallization aids as well as structures with topological and/or geometrical errors; and (iii) rejects structures of low quality. Importantly, the procedure not only includes checks for resolution and DPI, but also employs the recently developed EDIA method to assess the support of individual atoms of a structure by the electron density.
The Sperrylite Dataset contains (among others) a total of 91 ligands represented by at least ten high-quality structures of their protein-bound conformations. Recently we published an analysis of the diversity of the conformations of these ligands. Of these 91 molecules, 69 had at least two distinct conformations (defined by an RMSD greater than 1 Å). For a representative subset of 17 approved drugs and cofactors we observed a clear trend for the formation of few clusters of highly similar conformers. Even for proteins that share a very low sequence identity, ligands were regularly found to adopt similar conformations. For cofactors, a clear trend for extended conformations was measured, although in few cases also coiled conformers were observed.
The Sperrylite dataset has been published in Frontiers in Chemistry. The full dataset can be downloaded from here, whereas the subset of 91 ligands represented by at least ten high-quality conformations is available for download here.