Developing a Bioinformatics Database for Disulfide Bonds Research

The Protein Data Bank (PDB) bioinformatics database is the world’s largest repository of experimentally-determined structures of proteins, nucleic acids, and complex assemblies. All data is gathered using experimental methods such as X-ray, spectroscopy, crystallography, NMR, etc. This article explains how to extract, filter, and clean data from the PDB to make it suitable for further analysis.

Toptalauthors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

The Protein Data Bank (PDB) bioinformatics database is the world’s largest repository of experimentally-determined structures of proteins, nucleic acids, and complex assemblies. All data is gathered using experimental methods such as X-ray, spectroscopy, crystallography, NMR, etc. This article explains how to extract, filter, and clean data from the PDB to make it suitable for further analysis.

Toptalauthors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.
Viktor Bojović

With a diverse background in science and teaching and his software engineer experience, Victor is a passionate programmer with many talents.


The Protein Data Bank (PDB) bioinformatics database is the world’s largest repository of experimentally-determined structures of proteins, nucleic acids, and complex assemblies. All data is gathered using experimental methods such as X-ray, spectroscopy, crystallography, NMR, etc.

This article explains how to extract, filter, and clean data from the PDB. This, in turn, enables the type of analysis explained in the article Occurrence of protein disulfide bonds in different domains of life: a comparison of proteins from the Protein Data Bank, published in Protein Engineering, Design and Selection, Volume 27, Issue 3, 1 March 2014, pp. 65–72.

The PDB has a lot of repeating structures with different resolutions, methods, mutations, etc. Doing an experiment with the same or similar proteins can produce bias in any group analysis, so we will need to choose the correct structure from among any set of duplicates. For that purpose, we need to use a non-redundant (NR) set of proteins.

For the purpose of normalization, I recommend downloading the chemical compound dictionary for importing atom names into a database that uses 3NF or uses a star schema and dimensional modeling. (I’ve also used DSSP to help eliminate problematic structures. I won’t cover that in this article, but note that I didn’t use any other DSSP features.)

Data used in this research contains single-unit proteins who contain at least one disulfide bond taken from different species. To perform an analysis, all disulfide bonds are first classified as consecutive or nonconsecutive, by domain (archaea, prokaryote, viral, eukaryote, or other), and also by length.

Primary and tertiary protein structures

Primary and tertiary protein structures, before and after protein folding.
Source: Protein Engineering, Design and Selection, as mentioned at the beginning of this article.


To be ready for input into R, SPSS, or some other tool, an analyst will need the data to be in a database table with this structure:

codecharacter(4)Experiment ID (alphanumeric, case-insensitive, and cannot start with a zero)
titlecharacter varying(1000)Title of the experiment, for reference (field can be also text format)
ss_bondsintegerNumber of disulfide bonds in the chosen chain
ssbonds_overlapintegerNumber of overlapping disulfide bonds
intra_countintegerNumber of bonds made within the same chain
sci_name_srccharacter varying(5000)Scientific name of organism from which the sequence is taken
tax_pathcharacter varyingPath in Linnaean classification tree
src_classcharacter varying(20)Top-level class of organism (eukaryote, prokaryote, virus, archaea, other)
has_reactives7booleanTrue if and only if the sequence contains reactive centers
len_class7integerLength of sequence in set 7 (set with p-value 10e-7 calculated by blast)

Materials and Methods

In order to accomplish this goal, the first step is to gather data from That site contains downloadable PDB structures of experiments in various formats.

Although data is stored in multiple formats, in this example, only the formatted fixed-space delimited textual format (PDB) will be used. An alternative to the PDB textual format is its XML version, PDBML, but it sometimes contains malformed atom position naming entries, which can cause problems for data analysis. The older mmCIF and other formats may also be available, but they will not be explained in this article.

The PDB Format

The PDB format is a fragmented fixed-width textual format which can easily be parsed by SQL queries, Java plugins, or Perl modules, for example. Each data type in the file container is represented as a line beginning with the appropriate tag—we’ll go through each tag type in the following subsections. The line length is less than or equal to 80 characters, where a tag takes six or less characters plus one or more spaces which together take eight bytes. There are also cases without spaces between tags and data, usually for CONECT tags.


The TITLE tag marks a line as being (part of) the title of the experiment, containing the molecule name and other relevant data like the insertion, mutation, or deletion of a specific amino acid.

TITLE    3 STRUCTURES                                                           

In the case where there are multiple lines to a TITLE record, then the title has to be concatenated, ordering by a continuation number, which is placed, right-aligned, on bytes 9 and 10 (depending on the number of these lines).


The data stored in ATOM lines is coordinate data for each atom in an experiment. Sometimes an experiment has insertions, mutations, alternate locations, or multiple models. This results in repeating the same atom multiple times. Choosing the right atoms will be explained later.

ATOM    390  N   GLY A  26      -1.120  -2.842   4.624  1.00  0.00           N  
ATOM    391  CA  GLY A  26      -0.334  -2.509   3.469  1.00  0.00           C  
ATOM    392  C   GLY A  26       0.682  -1.548   3.972  1.00  0.00           C  
ATOM    393  O   GLY A  26       0.420  -0.786   4.898  1.00  0.00           O  
ATOM    394  H   GLY A  26      -0.832  -2.438   5.489  1.00  0.00           H  
ATOM    395  HA2 GLY A  26       0.163  -3.399   3.111  1.00  0.00           H  
ATOM    396  HA3 GLY A  26      -0.955  -2.006   2.739  1.00  0.00           H  

The example above is taken from the experiment 1BAH. The first column marks the type of record, and the second column is the serial number of the atom. Every atom in a structure has its own serial number.

Next to the serial number there is the atom position label, which takes four bytes. From that atom position, it is possible to extract the chemical symbol of the element, which is not always present in the record data in its own separate column.

After the atom name there is a three-letter residue code. In the case of proteins, that residue corresponds to an amino acid. Next, the chain is coded with one letter. By chain we mean a single chain of amino acids, with or without gaps, although sometimes ligands can be assigned to a chain; this case is detectable through very large gaps in an amino acid sequence, which is in the next column. Sometimes the same structure can be scanned with mutations included, in which case the insertion code is available in an extra column after the sequence column. The insertion code contains a letter to help to distinguish which residue is affected.

The next three columns are the spatial coordinates of each atom, measured in Angstroms (Å). Next to these coordinates is the occupancy column, which says what the probability is for the atom to be in that place, on the usual scale of zero to one.

The second-last column is the temperature factor, which carries information about the disorder in the crystal, measured in Ų. A value greater than 60Ų signifies disorder, while one lower than 30Ų signifies confidence. It is not always present in PDB files because it depends on the experimental method.

The next columns—symbol and charge—are usually missing. The chemical symbol can be gathered from the atom position column, as we mentioned above. When the charge is present, it’s suffixed to the symbol as an integer followed by + or -, e.g. N1+.


This marks the end of the chain. Even without this line, it is easy to distinguish where a chain ends. Thus, often the TER line is missing.


A MODEL line marks where the model of a structure starts, and it contains the serial number of the model.

After all atomic lines in that model, it ends with an ENDMDL line.


These lines contain disulphide bonds between cysteine amino acids. Disulfide bonds can be present in other residue types, but in this article only amino acids will be analyzed, so only cysteine is included. The following example is from the experiment with code 132L:

SSBOND   1 CYS A    6    CYS A  127                          1555   1555  2.01
SSBOND   2 CYS A   30    CYS A  115                          1555   1555  2.05
SSBOND   3 CYS A   64    CYS A   80                          1555   1555  2.02
SSBOND   4 CYS A   76    CYS A   94                          1555   1555  2.02

In this example there are four disulfide bonds tagged in the file with their sequence number in the second column. All of these bonds use cysteine (columns 3 and 6), and all of them are present in chain A (columns 4 and 7). After each chain there is a residue sequence number indicating the bond’s position in the peptide chain. Insertion codes are next to each residue sequence, but in this example they aren’t present because there was no amino acid inserted in that region. The two columns before the last one are reserved for symmetry operations, and the last column is the distance between sulfur atoms, measured in Å.

Let’s take a moment to give some context to this data.

The pictures below, taken using the NGL viewer, show the structure of experiment 132L. In particular, they show a protein without ligands. The first picture uses a stick representation, with CPK coloring showing sulfurs and their bonds in yellow. V-shaped sulfur connections represent methionine connections, while the Z-shaped connections are disulfide bonds between cysteines.

Stick representation with CPK coloring showing sulfur disulfide bonds in yellow

In the next picture, a simplified method of protein visualization called backbone visualization is shown colored by amino acids, where cysteines are yellow. It represents the same protein, with its sidechain excluded, and only part of its peptide group included—in this case, the protein backbone. It’s made of three atoms: N-terminal, C-alpha, and C-terminal. This picture doesn’t show disulfide bonds, but it’s simplified to show the protein’s spatial arrangement:

Simplified protein backbone colored by amino acids where cysteines are yellow

Pipes are created by joining peptide-bonded atoms with a C-alpha atom. Cysteine’s color is same as the color of sulfur in the CPK coloring method. When cystenes come close enough, their sulfurs create disulfide bonds, and that strengthens the structure. Otherwise this protein would bind too much, and its structure would be less stable at higher temperatures.


These records are used for tagging connections between atoms. Sometimes these tags are not present at all, whereas other times all the data is filled in. In the case of analyzing disulfide bonds, this part can be useful—but it isn’t necessary. That’s because, in this project, non-tagged bonds are added by measuring distances, so this would be overhead, and also has to be checked.


These records contain information about the source organism from which the molecule has been extracted. They contain subrecords for easier location in taxonomy, and have the same multi-line structure we saw with title records.

SOURCE    MOL_ID: 1;                                  
SOURCE   4 ORGANISM_TAXID: 7165;                      
SOURCE   5 GENE: GST1-6;                              
SOURCE   7 EXPRESSION_SYSTEM_TAXID: 562;              

NR Format

This is a list of non-redundant (NR) chain PDB sets. Its snapshots can be found at Its purpose is to avoid unnecessary biases caused by protein similarity. NR has three sets with different identity p-value levels created by comparison of all PDB structures. The result is added to textual files which will be explained later. Not all columns are needed for this project, so only the important ones will be explained.

The first two columns contain the unique PDB experiment code and the chain identifier as explained for ATOM records above. Columns 6, 9, and C contain information about p-value representativity, which is the level of similarity of sequences calculated by BLAST. If that value is zero, then it is not accepted to be part of a set; if the value is 1, then it is. The mentioned columns represent the acceptance of sets with p-values cutoffs of 10e-7, 10e-40, and 10e-80, respectively. Only sets with a p-value cutoff of 10e-7 will be used for analysis.

The last column contains info about a structure’s acceptability, where a is acceptable and n is not.

# 1  2       3     4    5 6     7    8 9     A    B C     D    E F    G      H      I      J     K     L   M   N     O P Q
3F8V A   69715     1    1 1     1    1 1     1    1 1  9427    1 1   0.00   0.00   0.00   0.00  1.08   1   6   5   164 X a
3DKE X   68132     1    2 0     1    2 0     1    2 0 39139    1 1   0.00   0.00   0.00   0.00  1.25   1  11   7   164 X a
3HH3 A   77317     1    3 0     1    3 0     1    3 0    90    1 0   0.00   0.00   0.00   0.00  1.25   1   5   4   164 X a
3HH5 A   77319     1    4 0     1    4 0     1    4 0    90    2 0   0.00   0.00   0.00   0.00  1.25   1   4   4   164 X a

Database Construction and Parsing Data

Now that we have an idea of what we’re dealing with and what we need to do, let’s get started.

Downloading Data

All data for this analysis can be found at these three addresses:

The first two links contain a list of archives. The latest archive from each should be used to avoid problems stemming from a lack of resolution or quality. The third link contains the newest taxonomy archive directly.

Parsing Data

Usually the parsing of PDB files is done by plugins or modules in Java, Perl, or Python. In the case of this research, I wrote a custom Perl application without using a pre-written PDB-parsing module. The reason for that is when parsing a large quantity of data, in my experience, the most common problem with using experimental data is errors in the data. Sometimes there are errors with coordinates, distances, line lengths, comments in places where they shouldn’t be, etc.

The most effective way to deal with this is to initially store everything in the database as raw text. Common parsers are written to deal with ideal data that conforms completely to specifications. But in practice, data is not ideal, and that will be explained in filtering section where you’ll find the Perl import script.

Database Construction

When constructing the database, note that this database is built for processing data. Later analysis will be done in SPSS or R. For our purposes here it is recommended to use PostgreSQL with at least version 8.4.

The table structure is directly copied from the downloaded files with only a few small changes. In this case, the number of records is far too small for it to be worth spending our time on normalization. As mentioned, this database is single-use only: These tables aren’t built to be served on a website—they are just there for processing data. Once that is finished, they can be dropped, or backed up as supplementary data, perhaps for repeating the process by some other researcher.

In this case, the final result will be one table which can then be exported to a file for use with some statistical tool like SPSS or R.


Data extraction from ATOM records has to be connected to HEADER or TITLE records. The data hierarchy is explained in the picture below.

The natural hierarchy of disulfide bond data that would result in a database in third normal form

Since this picture is a simplified representation of a database in the third normal form (3NF), for our purposes it contains too much overhead. The reason: To calculate the distance between atoms for disulfide bond detection, we would need to do joins. In this case, we would have a table joined to itself twice, and also joined to a secondary and primary structure twice each, which is a very slow process. Since not every analysis needs secondary structure information, another schema is proposed in case you need to reuse data or analyze bigger quantities of disulfide bonds:

A more detailed model of a database schema that does not use secondary structures (3NF)

Disulfide bonds are not so frequent as other covalent bonds are, so a warehouse model is not needed, although it could be used. The star schema and dimensional modeling below will take too much time to develop, and will make queries more complex:

The database layout using the star schema and dimensional modeling

In cases where all bonds have to be processed, then I recommend the star schema.

(Otherwise it’s not needed, because disulfide bonds aren’t as common as other bonds are. In the case of this work, the number of disulfide bonds is near 30,000, which may be fast enough in 3NF, but I decided to process it via a non-normalized table, so it isn’t pictured here.)

The expected total number of all covalent bonds is at least twice the number of atoms in the tertiary structure, and in that case 3NF would be very slow, so denormalization using the star schema form is needed. In that schema, some tables have two foreign key checks, and that is because a bond is created between two atoms, so each atom needs to have its own primary_structure_id, atom_name_id and residue_id.

There are two ways to fill the d_atom_name dimension table: from data, and from another source, the chemical component dictionary I mentioned earlier. Its format is similar to the PDB format: Only RESIDUE and CONECT lines are useful. This is because RESIDUE’s first column contains a residue three-letter code, and CONECT contains the name of the atom and its connections, which are also atom names. So from this file, we can parse all atom names and include them in our database, although I recommend you allow for the possibility of the database containing unlisted atom names.

RESIDUE   PRO     17
CONECT      N      3 CA   CD   H   
CONECT      CA     4 N    C    CB   HA  
CONECT      C      3 CA   O    OXT 
CONECT      O      1 C   
CONECT      CB     4 CA   CG   HB2  HB3 
CONECT      CG     4 CB   CD   HG2  HG3 
CONECT      CD     4 N    CG   HD2  HD3 
CONECT      OXT    2 C    HXT 
CONECT      H      1 N   
CONECT      HA     1 CA  
CONECT      HB2    1 CB  
CONECT      HB3    1 CB  
CONECT      HG2    1 CG  
CONECT      HG3    1 CG  
CONECT      HD2    1 CD  
CONECT      HD3    1 CD  
CONECT      HXT    1 OXT 
HET    PRO             17
FORMUL      PRO    C5 H9 N1 O2

In this project, speed of coding is more relevant than speed of execution and storage consumption. I decided not to normalize—after all, our goal is to generate a table with the columns mentioned in the intro.

In this part, only the most important tables will be explained.

The main tables are:

  • proteins: Table with experiment names and codes.
  • ps: Primary structure table which will contain sequence, chain_id, and code.
  • ts: Table containing tertiary/quaternary structure extracted from raw data and transformed into ATOM record format. This will be used as a staging table, and can be dropped after extraction. Ligands are excluded.
  • sources: The list of organisms from which experimental data was derived.
  • tax_names, taxonomy_path, taxonomy_paths: Linnean taxonomy names from the NCBI taxonomy database, used to get taxonomy paths from organisms listed in sources.
  • nr: List of NCBI non-redundant proteins extracted from the NR set.
  • pdb_ssbond: List of disulfide bonds in a given PDB file.

Filtering and Processing Data

Data is retrieved from snapshots from the RCSB PDB repository.

Each file is imported into a single table raw_pdb in our PostgreSQL database using a Perl script. The script uses transactions of 10,000 inserts per chunk.

The structure of raw_pdb is this:

codecharacter varying(20)not null
line_numintegernot null
line_contcharacter varying(80)not null

The import script looks like this:

#!/usr/bin/perl -w

use FindBin '$Bin';
use DBI;

$dbName = 'bioinf'; 
$dbLogin = 'ezop'; 
$dbPass = 'XYZ';

$conn = DBI->connect("DBI:Pg:database=$dbName;host=localhost", "$dbLogin", "$dbPass", {'RaiseError' => 1, 'AutoCommit' => 0});

die "./ <table> <file>" if not defined($ARGV[0]);

$recordCount = 0;
$table = $ARGV[0];
$fName = $ARGV[1];
open F, "zcat $fName|";
while (<F>) {
    $linija = $_;
    $recordCount += 1;
    $sql = "insert into $table (code, line_num, line_cont) values (?, ?, ?)";

    $conn->do($sql, undef, $fName, $recordCount, $linija);
    $conn->commit() if ($recordCount%10000 == 0);
close F;


After lines are imported, they are parsed using functions we will define below.

From raw_pdb data, we generate the tables ts, ps, proteins, sources, sources_organela, and ss_bond by parsing the corresponding records.

The ps table has three important columns: chain, length, and sequence. Protein sequence is generated using C-alpha atoms for each chain and for each residue ordered by residue sequence, taking only the first insertion and the first alternate location. chain is taken from the TS.chain column, and length is simply the precalculated length of the sequence string. Since this article is meant to analyze only single chains and intrachain connections, multiple-chain proteins are skipped in our analysis here.

Within SSBOND records, all disulfide bonds are stored in the pdb_ssbond table, which inherits from the pdb_ssbond_extended table. pdb_ssbond looks like this:

idintegernot nullnextval(‘pdb_ssbond_id_seq’::regclass) 
codecharacter(4)  four-letter code
ser_numinteger  serial number of ssbond
residue1character(3)  first residue in bond
chain_id1character(1)  first chain in bond
res_seq1integer  sequential number of first residue
i_code1character(1)  insertion code of first residue
residue2character(3)  second residue in bond
chain_id2character(1)  second chain in bond
res_seq2integer  sequential number of second residue
i_code2character(1)  insertion code of second residue
sym1character(6)  first symmetry operator
sym2character(6)  second symmetry operator
distnumeric(5,2)  distance between atoms
is_reactivebooleannot nullfalsemark for reactivity (to be set)
is_consecutivebooleannot nulltruemark for consecutive bonds (to be set)
rep7booleannot nullfalsemark for set-7 structures (to be set)
rep40booleannot nullfalsemark for set-40 structures (to be set)
rep80booleannot nullfalsemark for set-80 structures (to be set)
is_from_pdbboolean trueis taken from PDB as source (to be set)

I also added these indexes:

"pdb_ssbond_pkey" PRIMARY KEY, btree (id)
"ndxcode1" btree (code, chain_id1, res_seq1)
"ndxcode2" btree (code, chain_id2, res_seq2)

It is assumed that distribution of disulfide bonds prior to the cutoff is Gaussian (without testing with, e.g., KS-test), so standard deviations were calculated on each distance between cysteines in same protein to discover the range of permitted bond lengths and compare them to the cutoff. The cutoff was same as the calculated mean plus-minus three standard deviations. We have extended the range to introduce more possible disulfide bonds which weren’t enlisted in PDB file in SSBOND rows but which we have inserted ourselves by calculating distances between ATOM records. The chosen range for ssbonds are between 1.6175344456264 and 2.48801947642267 Å, which is the mean (2.05) plus-minus four standard deviations:

select                     count(1) as     cnt
    ,      stddev(dist)             as std_dev  
    ,                     avg(dist) as avg_val
    ,     -stddev(dist) + avg(dist) as   left1
    ,      stddev(dist) + avg(dist) as  right1
    , -2 * stddev(dist) + avg(dist) as   left2
    ,  2 * stddev(dist) + avg(dist) as  right2
    , -3 * stddev(dist) + avg(dist) as   left3
    ,  3 * stddev(dist) + avg(dist) as  right3
    , -4 * stddev(dist) + avg(dist) as   left4
    ,  4 * stddev(dist) + avg(dist) as  right4
    ,         min(dist)
    ,         max(dist)
from pdb_ssbond_tmp
where dist > 0
    and dist < 20;

The TS table contains the coordinates of all atoms, but only cysteines will be used, with their sulfur named " SG ". So another staging table with " SG " sulfur atoms only is created for speeding up the process by reducing the number of records to search. When selecting sulfurs only, the number of combinations is much less than in the case of all atoms—194,574 of the former compared with 122,761,100 of the latter. Within this table joined to itself, distances are calculated using the Euclidean distance, and results are imported into the pdb_ssbond table but only where the distance is between the defined lengths calculated earlier. The reason for doing this speedup is to lessen the amount of time of running the whole process again—for checking purposes—keeping it within one day.

To clean disulfide bonds, we use the following algorithm:

  • Delete when both sides of connection point to same amino acid
  • Delete bonds whose length is not between 1.6175344456264 and 2.48801947642267
  • Remove insertions
  • Remove bonds caused by alternate atom locations, but leaving first location

The code for this (taking the pdb_ssbond table as the first argument) is:

 CREATE OR REPLACE FUNCTION pdb_ssbond_clean2( 
    clean_icodes boolean,
    clean_altloc boolean,
    mark_reactive boolean,
    mark_consecutive boolean)
RETURNS void AS $$
declare _res integer;
    delete from pdb_ssbond b 
    where exists (
        from pdb_ssbond a
        where a.code = b.code
        and >
        and (
            (   a.chain_id1 = b.chain_id1 and a.res_seq1 = b.res_seq1
            and a.chain_id2 = b.chain_id2 and a.res_seq2 = b.res_seq2)
            (   a.chain_id1 = b.chain_id2 and a.res_seq1 = b.res_seq2
            and a.chain_id2 = b.chain_id1 and a.res_seq2 = b.res_seq1)
    ) ; 

    from pdb_ssbond
    where chain_id1 = chain_id2
    and res_seq1 = res_seq2
    and i_code1 = i_code2;
    update pdb_ssbond 
    set is_consecutive = true
    , is_reactive = false;
    delete from pdb_ssbond 
    where not dist between 1.6175344456264 and 2.48801947642267;
    if clean_icodes then  
        delete from pdb_ssbond 
        where  i_code1 not in ('',  ' ', 'A') 
        or     i_code2 not in ('',  ' ', 'A') ;
    end if;
    if clean_altloc then  
        delete from pdb_ssbond a
        where exists (
            select 1
            from pdb_ssbond  b
            where   b.code = a.code
                and b.chain_id1 = a.chain_id1
                and b.res_seq1 = a.res_seq1
                and b.i_code1 = a.i_code1
                and b.ser_num < a.ser_num
        ) ;
        delete from pdb_ssbond a
        where exists (
            select 1
            from pdb_ssbond  b
            where   b.code = a.code
                and b.chain_id2 = a.chain_id2
                and b.res_seq2 = a.res_seq2
                and b.i_code2 = a.i_code2
                and b.ser_num < a.ser_num
        ) ;
    end if;

    if mark_reactive then  
            update pdb_ssbond 
            set is_reactive = false ; 
            update pdb_ssbond 
            set is_reactive = true
            where exists (
                select 1
                from pdb_ssbond  b
                where   b.code = pdb_ssbond.code
                    and b.chain_id1 = pdb_ssbond.chain_id1
                    and b.res_seq1 = pdb_ssbond.res_seq1
                    and b.i_code1 = pdb_ssbond.i_code1
                    and b.ser_num < pdb_ssbond.ser_num
            ) ;  
            update pdb_ssbond 
            set is_reactive = true         
            where exists (
                select 1
                from pdb_ssbond  b
                where   b.code = pdb_ssbond.code
                    and b.chain_id2 = pdb_ssbond.chain_id2
                    and b.res_seq2 = pdb_ssbond.res_seq2
                    and b.i_code2 = pdb_ssbond.i_code2
                    and b.ser_num < pdb_ssbond.ser_num
            ) ; 
            update pdb_ssbond 
            set is_reactive = true         
            where (code, chain_id1, res_seq1, i_code1)
            in (
                select code, chain_id1 as c, res_seq1 as r, i_code1 as i
                from pdb_ssbond
                select code, chain_id2 as c, res_seq2 as r, i_code2 as i
                from pdb_ssbond
            ) ; 
            update pdb_ssbond 
            set is_reactive = true         
            where (code, chain_id2, res_seq2, i_code2)
            in (
                select code, chain_id1 as c, res_seq1 as r, i_code1 as i
                from pdb_ssbond
                select code, chain_id2 as c, res_seq2 as r, i_code2 as i
                from pdb_ssbond
    end if;
    if mark_consecutive then
           update pdb_ssbond
           set is_consecutive = false;

            update pdb_ssbond
            set is_consecutive = true
            where not exists (
                select 1
                from pdb_ssbond a
                    a.code = pdb_ssbond.code
                    and (
                        (a.chain_id1 = pdb_ssbond.chain_id1 and a.chain_id2 = pdb_ssbond.chain_id2)
                        (a.chain_id1 = pdb_ssbond.chain_id2 and a.chain_id2 = pdb_ssbond.chain_id1)
                    and (
                           (a.res_seq1 between pdb_ssbond.res_seq1 and pdb_ssbond.res_seq2)
                        or (a.res_seq2 between pdb_ssbond.res_seq1 and pdb_ssbond.res_seq2)
                        or (pdb_ssbond.res_seq1 between a.res_seq1 and a.res_seq2)
                        or (pdb_ssbond.res_seq2 between a.res_seq1 and a.res_seq2)
                    and not (
                            a.code = pdb_ssbond.code
                        and a.chain_id1 = pdb_ssbond.chain_id1
                        and a.chain_id2 = pdb_ssbond.chain_id2 
                        and a.res_seq1 = pdb_ssbond.res_seq1
                        and a.res_seq2 = pdb_ssbond.res_seq2
    end if; 
$$ LANGUAGE plpgsql;

With this, the non-redundant set of proteins is imported to the nr table which is joined to the ps and proteins tables, and sets are marked (set7, set40, and set80). At the end, according to protein quantity only one set will be analyzed. Mismatched chains between PDB and NR are removed from analysis.

Proteins without disulfide bonds are excluded from research, together with proteins that don’t belong to any set. Data is processed with DSSP, and these files which had problems with resolution or too many atoms to be processed are also excluded. Only proteins with single chains are used as result for analysis because interchain connections were not kept, although they are easily calculated from the ssbond table by counting the number of connections that have different chains.

For the remaining proteins, an update is done for the total number of bonds and the number of overlapping bonds, and this is done for each of the sets.

The source organism is extracted from SOURCE records. In cases where it is unknown, synthetic, designed, artificial, or hybrid, it is discarded from research. Low-resolution proteins are also excluded only when their side chain is not visible.

SOURCE records are stored in the sources table, which contains taxonomy rows. In some cases, the taxonomy is missing or incorrect. In these cases, the manual correction of experts is needed.

From the source and taxonomy downloaded from NCBI, the class is assigned to each primary structure. In case a class is unable to be assigned, the protein is removed from the analysis list. Knowing that biological databases are being used, an extra check of all source records and NCBI taxonomy classes is recommended to be performed by a biologist, otherwise there might be problems with classifications between different domains. To match source cellular locations with taxonomy IDs, data from the source table is extracted into the table sources_organela where all data is written as code, tag, and value. Its format is shown below:

select * from sources_organela where code = '1rav';

The taxonomy archive we want to use is a ZIP file containing seven dump files. Among these files, two of the most important are names.dmp and merged.dmp. Both files are CSV tab-pipe delimited files as detailed in the documentation:

  • The file merged.dmp contains a list of previous taxonomy IDs, and the current taxonomy IDs into which each one was merged.
  • names.dmp contains these important columns in this order:
    • tax_id: The taxonomy ID
    • name_txt: The name of the species, and if applicable, the unique name (where species can be found with multiple names, this helps disambiguate)
  • division.dmp contains the names of top-level domains which we will use as our classes.
  • nodes.dmp is the table which contains a hierarchical structure of organisms using taxonomy IDs.
    • It contains a parent taxonomy ID, a foreign key which can be found in names.dmp.
    • It also contains a division ID which is important for us to correctly store the relevant top domain data.

With all this data and manual corrections (setting correct domains of life) we were able to create the structure of the taxonomy_path table. A sampling of data looks like this:

select * from taxonomy_path order by length(path)  limit 20 offset 2000;
142182cellular organisms;Bacteria;Gemmatimonadetesfffft
136087cellular organisms;Eukaryota;Malawimonadidaeftfff
649454Viruses;unclassified phages;Cyanophage G1168tffff
321302Viruses;unclassified viruses;Tellina virus 1tffff
649453Viruses;unclassified phages;Cyanophage G1158tffff
536461Viruses;unclassified phages;Cyanophage S-SM1tffff
536462Viruses;unclassified phages;Cyanophage S-SM2tffff
77041Viruses;unclassified viruses;Stealth virus 4tffff
77042Viruses;unclassified viruses;Stealth virus 5tffff
641835Viruses;unclassified phages;Vibrio phage 512tffff
1074427Viruses;unclassified viruses;Mouse Rosavirustffff
1074428Viruses;unclassified viruses;Mouse Mosavirustffff
480920other sequences;plasmids;IncP-1 plasmid 6-S1ffftf
2441other sequences;plasmids;Plasmid ColBM-Cl139ffftf
168317other sequences;plasmids;IncQ plasmid pIE723ffftf
536472Viruses;unclassified phages;Cyanophage Syn10tffff
536474Viruses;unclassified phages;Cyanophage Syn30tffff
2407other sequences;transposons;Transposon Tn501ffftf
227307Viruses;ssDNA viruses;Circoviridae;Gyrovirustffff
687247Viruses;unclassified phages;Cyanophage ZQS-7tffff

Before any analysis, to avoid biases, sequences have to be checked for their level of identity. Although the NR set contains sequences which are already compared between each other, an extra check is always recommended.

For each disulfide bond’s prior statistical analysis, data is marked if it is reactive or overlapping. After marking overlaps, that information automatically reveals how many consecutive and non-consecutive bonds are inside each protein, and that data is stored in the proteins table from which all protein complexes are excluded in final result.

Each disulfide bond is marked also for its association to sets, by checking both bond sides to see if they belong to the same NR set. Where that is not the case, the disulfide bond is skipped for that set analysis.

To analyze the quantity of bonds by their variance, each length has to be put in a specific class. In this case, only five classes are chosen as written in the function below.

CREATE OR REPLACE FUNCTION len2class(len integer)
    RETURNS integer AS
        when len <= 100                then 1
        when len >  100 and len <= 200 then 2
        when len >  200 and len <= 300 then 3
        when len >  300 and len <= 400 then 4
                                       else 5
    COST 100;

Most of the protein sizes are less than 400 amino acids, so length classification is done by splitting lengths into five ranges, each taking 100 amino acids, except the last one which takes the rest. Below you can see how to use this function to extract data for analysis:

 SELECT p.code,
    ( SELECT
           FROM src_classes s
          WHERE s.src_class::text = p.src_class::text) AS src_class_id,
    ( SELECT s.val
           FROM sources_organela s
          WHERE s.code = p.code::bpchar AND s.tag::text = 'EXPRESSION_SYSTEM_CELLULAR_LOCATION'::text) AS expression_system_cellular_location,
    ( SELECT s.val
           FROM sources_organela s
          WHERE s.code = p.code::bpchar AND s.tag::text = 'CELLULAR_LOCATION'::text) AS cellular_location,
   FROM proteins p
     JOIN nr n ON n.code::text = p.code::text AND n.rep7 = true
     JOIN ps ps ON ps.code::text = n.code::text AND ps.chain_id = n.chain_id AND ps.het = false
  WHERE p.is_excluded = false AND p.chain_cnt = 1 AND p.is_set7 = true AND p.reactive_cnt7 = 0
  ORDER BY p.code;

An example result with corrected titles and some manually added columns is shown below.

PDB codeTotal number of SS-bondsNumber of non-consecutive SS-bondsPDB length / amino acidsDomainTargetP 1.1TatP 1.0SignalP 4.1ChloroP 1.1TMHMM 2.0 number of transmembrane domainsBig-pinucPredNetNES 1.1PSORTb v3.0SecretomeP 2.0LocTree2Consensus localization prediction
1akp20114BacteriaNDTat-signalno signal peptideND0NDNDNDunknownNDfimbriumunknown
1bhu20102BacteriaNDTat-signalsignal peptideND1NDNDNDunknownNDsecretedunknown
1c750071BacteriaNDTat-signalno signal peptideND0NDNDNDcytoplasmic membranenonclassical secretionperiplasmunknown
1c8x00265BacteriaNDTat-signalsignal peptideND1NDNDNDunknownNDsecretedunknown
1cx110153BacteriaNDTat-signalsignal peptideND1NDNDNDextracellularNDsecretedunknown
1dab00539BacteriaNDTat-signalsignal peptideND0NDNDNDouter membraneNDouter membraneunknown
1dfu0094BacteriaNDTat-signalno signal peptideND0NDNDNDcytoplasmicNDcytosol unknown
1e8r2250BacteriaNDTat-signalsignal peptideND0NDNDNDunknownNDsecretedsecreted
1esc30302BacteriaNDTat-signalsignal peptideND1NDNDNDextracellularNDperiplasmunknown
1g6e1087BacteriaNDTat-signalsignal peptideND1NDNDNDunknownNDsecretedunknown

PostgreSQL as a Processing Intermediary

In this work, we showed how to process data, from fetching to analyzing. When working with scientific data, sometimes normalization is needed, and sometimes not. When working with small quantities of data which will not be reused for another analysis, then it is enough to leave it denormalized where processing is fast enough.

The reason why this was done all in one bioinformatics database is that PostgreSQL is able to integrate many languages. This includes R, which can do statistical analysis directly in-database—the subject for a future article on bioinformatics tools.

A special thanks to Toptal colleagues Stefan Fuchs and Aldo Zelen for their invaluable consultations.

Understanding the basics

  • What is a database?

    A database is data that’s electronically organized and stored for easier retrieval.

  • What are the two main types of databases?

    In terms of bioinformatics, amino acid databases and nucleic acid databases are the two main types, but there are also hybrids.

  • What is a disulfide bond in biology?

    A disulfide bond is a bond between two sulfur atoms (in this case, between cysteines). Additionally, disulfide bonds can be formed between two cysteines from different protein chains.

  • What is the role of a disulfide bond?

    In the case of proteins, a disulfide bond stabilizes the structure.

  • Is a disulfide bond polar covalent?

    A disulfide bond is a covalent bond. When both sides of the bond are symmetric (both residues are cysteines) then it is non-polar. In the case where some metal ion is near it, the electron cloud changes and it becomes polar like it would be if it were asymmetric. Polar means it attracts water.

  • Which protein structure has disulfide bonds?

    Beta sheets, beta barrels, and beta strands have disulfide bonds. It is possible to find it in other structures, too.

  • What is the purpose of a database?

    To store and organize data for easier retrieval. In this article, a database helps perform ETL (extraction, transformation, and loading) with a relatively large quantity of data. Data is stored in the database, processed inside the database, and then extracted to be analyzed with SPSS or R.

  • How can bioinformatics be used?

    It’s mostly in medicine and biology, but also in physics and chemistry. It can help better understand the structures and properties of different pharmaceuticals, diseases, and abnormalities. It’s used to explore organism functioning, help to recognize unknown species, and find the closest species to a given one.

  • What is meant by data cleaning?

    Biologists’ databases are constantly curated, but often human and machine errors still happen, e.g., impossibly long disulfide bonds, wrongly classified organisms, or wrong taxonomy numbers. The bigger the database, the more errors can be found. One technique used as a starting point is searching for outliers.

Hire a Toptal expert on this topic.
Hire Now

About the author

With a diverse background in science and teaching and his software engineer experience, Victor is a passionate programmer with many talents.

World-class articles, delivered weekly.

By entering your email, you are agreeing to our privacy policy.

World-class articles, delivered weekly.

By entering your email, you are agreeing to our privacy policy.

Join the Toptal® community.