Searching for chemical structures is a critical first step in drug discovery, particularly when it comes to identifying lead compounds. Structure-driven searches have also become increasingly important in metabolomics and pharmacology research.
Many freely accessible databases offer full automation for searching compounds from patents, such as SciWalker and SureChEMBL [27]. Commercial databases PatBase and Orbit also support automated chemical extraction from patents.
CAS Registry
The CAS Registry is the world’s largest database of chemical substances and sequences. Each record possesses a unique CAS Registry Number (CASRN), divided into three parts by hyphens and ranging in length from two to seven digits.
Sequential number assignment ensures that newer compounds receive larger numbers than their predecessors in the CAS Registry. With a maximum capacity of one billion unique CAS numbers, over 15,000 substances are added daily to this vast database.
The Chemical Abstracts Online website can quickly and precisely search for chemical structures by entering their CAS Registry number into the search bar. This will return a list of articles that mention that chemical’s CAS Registry number.
PubChem
PubChem boasts an expansive library of chemical structures, many of which resemble drugs. These structures can be utilized for various purposes such as drug discovery, molecular modeling, optimization and de novo drug design.
The database is organized into three interlinked databases: Substance, Compound and BioAssay (see Relevant Websites section). Each record in these databases has a substance ID (SID) and compound ID (CID), which correspond to the molecule’s description in Substance database.
The Substance database follows a standardization protocol for substances. This involves verification steps in relation to valence, element, hydrogen and functional group identities. Structures are modified during this phase in order to provide an accurate representation of each molecule’s structure. This simplifies chemical information cleanup and normalization while keeping provenance clear.
SciFindern
SciFindern is a Chemistry-Ccurated Database from CAS, offering reference and substance searching with advanced analysis tools. It covers all aspects of Chemistry Research such as journal citations/abstracts, substance data, chemical reactions, and regulatory information.
SciFindern’s most powerful feature is its chemical structure search. You can either upload a structure drawn using CAS Draw or create one yourself for quick retrieval.
You can search for chemical structures using exact structures, substructures or similarity structures. Furthermore, you can refine results by restricting them to single-component substances.
Alternatively, you can utilize the Substance Roles feature to restrict your results to articles in which a specific substance played an important role (e.g., preparation, uses, analytical study).
Another helpful feature is the option to filter your search based on various criteria, such as commercial availability and molecular weight. Doing this can help reduce the number of results returned.
TOXNET
TOXNET is a collection of databases from the National Library of Medicine that provide information on hazardous chemicals, environmental health hazards, toxic releases, chemical nomenclature poisoning risk assessment and regulations as well as occupational safety and health. It’s used extensively by scientists, researchers and the general public alike.
TOXNET provides comprehensive data about the most prevalent substances and their properties. Additionally, it links to PubMed(r), the National Library of Medicine’s free web interface to biomedical literature worldwide, for further sources of toxicological knowledge.
DiXa utilizes the NCI/CADD Chemical Identifier Resolver (CIR) to map between chemical identifiers. It then utilizes PubChem’s PUG REST service for accessing molecular formulas, InChIs, IUPAC names and synonyms as well as structure images. Furthermore, diXa maps InChIKeys (3) to CASRNs and performs searches within repositories that use these CASRNs.
Entrez
Entrez (pronounced ‘an’treI’) is a search engine and web portal which gives users access to many health sciences databases at the National Center for Biotechnology Information (NCBI). It supports text-based searches of data such as DNA/protein sequences, gene mapping information, 3D structure data, PubMed MEDLINE articles and taxonomy info.
In addition to a unified query string, the system supports boolean operators and search term tags which restrict parts of a query statement to specific fields. This enables efficient retrieval of pertinent information as well as links back to source material – including 3D structures – from any location.
The NCBI Entrez database contains unique integer identifiers for genes and other loci in model organisms, which are tracked and linked to the NCBI Taxonomy database. Furthermore, most gene-specific records are linked to their associated chromosome position as well as related gene products.