
The Chemistry Development Kit (CDK) remains the de facto open-source toolkit for cheminformatics in 2026, providing a robust Java library for chemical structure representation, manipulation, and analysis. As computational chemistry accelerates, the CDK has evolved to support modern workflows including machine learning integration, high-throughput virtual screening, and FAIR (Findable, Accessible, Interoperable, Reusable) data compliance.
This guide covers installation, core features, practical workflows, and implementation tips tailored for 2026’s computational chemistry landscape.
Installing the CDK in 2026 is streamlined thanks to updated build systems and package managers.
Add the following to your pom.xml:
<dependency>
<groupId>org.openscience.cdk</groupId>
<artifactId>cdk-bundle</artifactId>
<version>2.10.0</version> <!-- Latest stable in 2026 -->
</dependency>
For modular access (e.g., only core or 3D rendering):
<dependency>
<groupId>org.openscience.cdk</groupId>
<artifactId>cdk-core</artifactId>
<version>2.10.0</version>
</dependency>
<dependency>
<groupId>org.openscience.cdk</groupId>
<artifactId>cdk-sdg</artifactId> <!-- 3D geometry -->
<version>2.10.0</version>
</dependency>
Use JShell for rapid prototyping:
jshell --class-path "cdk-bundle-2.10.0.jar"
import org.openscience.cdk.*;
import org.openscience.cdk.smiles.SmilesParser;
import org.openscience.cdk.interfaces.IAtomContainer;
SmilesParser sp = new SmilesParser();
IAtomContainer mol = sp.parseSmiles("CCO"); // Ethanol
System.out.println("Atoms: " + mol.getAtomCount());
The CDK models chemistry using a graph-based approach.
| Interface | Purpose | Example |
|---|---|---|
IAtom | Represents an atom (element, charge, isotope) | new Atom("C") |
IBond | Represents a bond (single, double, aromatic) | new Bond(atom1, atom2, IBond.Order.SINGLE) |
IAtomContainer | Container for atoms and bonds (a molecule) | new AtomContainer() |
IAtomContainer ethanol = new AtomContainer();
ethanol.addAtom(new Atom("C")); // C1
ethanol.addAtom(new Atom("C")); // C2
ethanol.addAtom(new Atom("O")); // O
ethanol.addBond(0, 1, IBond.Order.SINGLE); // C1-C2
ethanol.addBond(1, 2, IBond.Order.SINGLE); // C2-O
Note: Use
CDKAtomTypeMatcherto assign correct atom types (e.g.,sp3carbon).
The CDK supports 15+ chemical formats including SMILES, MOL, SDF, and InChI.
ISmilesParser smilesParser = new SmilesParser();
IAtomContainer mol = smilesParser.parseSmiles("c1ccccc1"); // Benzene
ISimpleReaderFactory factory = new SimpleReaderFactory();
try (InputStream in = new FileInputStream("molecules.sdf");
ISimpleReader reader = factory.createReader(new InputStreamReader(in))) {
IAtomContainer mol;
while ((mol = reader.read(new AtomContainer())) != null) {
System.out.println("Read molecule with " + mol.getAtomCount() + " atoms");
}
}
// To SMILES
SmilesGenerator sg = new SmilesGenerator();
String smiles = sg.create(mol);
// To InChI
InChIGeneratorFactory factory = InChIGeneratorFactory.getInstance();
InChIGenerator gen = factory.getInChIGenerator(mol);
String inchi = gen.getInchi();
Pro Tip: Use
InChIGenerator.StereoOption.ABSOLUTEfor stereochemistry-aware generation.
StructureDiagramGenerator sdg = new StructureDiagramGenerator();
sdg.setMolecule(mol);
sdg.generateCoordinates();
IAtomContainer mol2d = sdg.getMolecule();
Best Practice: Always generate 2D coordinates for visualization or machine learning input.
StructureDiagramGenerator sdg3d = new StructureDiagramGenerator();
sdg3d.setMolecule(mol);
sdg3d.generateCoordinates3D();
IAtomContainer mol3d = sdg3d.getMolecule();
Note: For accurate 3D, use
cdk-sdgwith MMFF94 or UFF force fields.
Fingerprints are essential for similarity and diversity analysis.
| Fingerprint | Purpose | Size (bits) |
|---|---|---|
PubchemFingerprinter | PubChem standard | 881 |
ExtendedFingerprinter | Extended connectivity | 1024 |
MACCSFingerprinter | Structural keys (166 bits) | 166 |
MorganFingerprinter | ECFP-like | variable |
// Generate fingerprint
PubchemFingerprinter fp = new PubchemFingerprinter();
BitSet fp1 = fp.getBitFingerprint(mol1);
BitSet fp2 = fp.getBitFingerprint(mol2);
// Tanimoto similarity
double tanimoto = Tanimoto.calculate(fp1, fp2);
System.out.println("Tanimoto similarity: " + tanimoto);
Tip: Use
HashFunctionfor faster similarity in large datasets.
// Define query: benzene ring
IAtomContainer query = new AtomContainer();
query.addAtom(new Atom("C"));
query.addAtom(new Atom("C"));
query.addAtom(new Atom("C"));
query.addBond(0, 1, IBond.Order.DOUBLE);
query.addBond(1, 2, IBond.Order.SINGLE);
query.addBond(2, 0, IBond.Order.DOUBLE);
// Search in molecule
SubstructureSearcher searcher = new SubstructureSearcher();
boolean matches = searcher.isSubstructure(query, mol);
System.out.println("Contains benzene? " + matches);
Substructure ModuleSubstructure sub = new Substructure(query);
sub.setQuery(query);
sub.setTarget(mol);
sub.match();
while (sub.hasNext()) {
IAtomContainer match = sub.next();
System.out.println("Match found: " + match.getAtomCount() + " atoms");
}
Stereochemistry is critical in drug discovery and synthesis planning.
SmilesParser sp = new SmilesParser();
IAtomContainer mol = sp.parseSmiles("C[C@H](O)C"); // (R)-2-butanol
// Check stereocenters
CDKHydrogenAdder hAdder = CDKHydrogenAdder.getInstance(mol.getBuilder());
hAdder.addImplicitHydrogens(mol);
// Generate 2D with stereochemistry
StructureDiagramGenerator sdg = new StructureDiagramGenerator();
sdg.setMolecule(mol);
sdg.generateCoordinates();
IAtomContainer mol2d = sdg.getMolecule();
// Export to InChI with stereochemistry
InChIGeneratorFactory factory = InChIGeneratorFactory.getInstance();
InChIGenerator gen = factory.getInChIGenerator(mol);
System.out.println("InChI: " + gen.getInchi());
Note: Use
CDKConstants.ATOM_PARITYto get tetrahedral parity.
The CDK is now tightly integrated with ML frameworks via vectorized molecular representations.
// Calculate molecular descriptors
DescriptorEngine engine = new DescriptorEngine(DescriptorEngine.MOLECULAR);
engine.process(mol);
List<Double> values = new ArrayList<>();
for (IDescriptor descriptor : engine.getDescriptorInstances()) {
values.addAll(Arrays.asList(descriptor.calculate(mol)));
}
// Convert to NumPy-compatible format (via ND4J or TensorFlow Java API)
float[] features = values.stream().mapToDouble(d -> d).toArray();
In 2026, CDK models can be exported to ONNX and used in Python:
# Python: Load ONNX model generated from CDK descriptors
import onnxruntime as ort
import numpy as np
sess = ort.InferenceSession("cdk_model.onnx")
input_name = sess.get_inputs()[0].name
output_name = sess.get_outputs()[0].name
features = np.random.rand(1, 512).astype(np.float32)
pred = sess.run([output_name], {input_name: features})[0]
Tip: Use
cdk-graphmodule to generate graph-based features (nodes = atoms, edges = bonds) for GNNs.
Automate large-scale chemical analysis with CDK pipelines.
LipinskiRuleOfFiveFilter filter = new LipinskiRuleOfFiveFilter();
for (IAtomContainer mol : moleculeSet) {
if (filter.accepts(mol)) {
System.out.println("PASSED: " + smilesGenerator.create(mol));
}
}
ForkJoinPoolList<IAtomContainer> molecules = loadFromSDF("large.sdf");
ForkJoinPool pool = new ForkJoinPool();
pool.submit(() ->
molecules.parallelStream()
.filter(mol -> filter.accepts(mol))
.forEach(mol -> System.out.println(smilesGenerator.create(mol)))
).get();
Performance Note: Use
cdk-memory-efficientartifact for large datasets (streaming SDF parsing).
While CDK is powerful, interoperability is key.
// CDK to JSON
String json = new CDK2JSON().convert(mol);
// Send to RDKit (Python) via REST or ZMQ
obabel -icdk input.cdk -omol2 -O output.mol2
Tip: Use
cdk-converterartifact for direct Java ↔ OB conversion.
AtomContainerSet for groups of molecules, not individual containers in loops.IteratingSDFReader for large SDF files.cdk-sdg if no 3D needed).AtomContainerBuilder.try (IteratingSDFReader reader = new IteratingSDFReader(
new FileInputStream("huge.sdf"),
new AtomContainer(),
1000)) { // Buffer 1000 molecules
while (reader.hasNext()) {
IAtomContainer mol = reader.next();
// Process in batches
}
}
| Issue | Cause | Solution |
|---|---|---|
NullPointerException in coordinates | Missing atom types | Run CDKAtomTypeMatcher.assignAtomTypes(mol) |
| SMILES parsing fails | Invalid SMILES | Use SmilesParser.SILENT mode or validate first |
| Slow performance | Large molecule sets | Use parallel streams or off-heap storage |
| Stereo mismatch in InChI | Incorrect parity assignment | Use StereoAnalyser to diagnose |
| Out of memory | Too many molecules in memory | Use streaming or database-backed storage |
The CDK continues to evolve with:
cdk-quantum).cdk-cuda.Final Tip: Always pin your CDK version in production to avoid breaking changes. Use semantic versioning (e.g.,
2.10.0) for reproducibility.
The Chemistry Development Kit in 2026 stands as a mature, extensible, and interoperable platform for computational chemistry. From basic molecule manipulation to machine learning integration, the CDK delivers the tools needed for modern cheminformatics workflows.
By leveraging modular design, efficient data structures, and seamless integration with other tools, developers can build scalable, reproducible, and FAIR-compliant chemistry pipelines. Whether you're filtering virtual libraries, training GNNs, or generating 3D conformers, the CDK provides a solid foundation—empowering innovation at the intersection of chemistry and data science.
Start with the cdk-bundle, explore the examples, and let the CDK accelerate your research in 2026 and beyond.
Practical b2b marketing strategy guide: steps, examples, FAQs, and implementation tips for 2026.
Practical b to b marketing strategy guide: steps, examples, FAQs, and implementation tips for 2026.
Web developers have long wrestled with a fundamental tension: how to keep users secure while maintaining seamless functionality across domai…

Comments
Sign in to join the conversation
No comments yet. Be the first to share your thoughts!