Bioinformatics, COMP0082 (A7U, A7P)  
Main Summer Alternative Assessment, 2019/20  
There are FOUR questions in total.  
Answer all FOUR questions.  
Marks for each part of each question are indicated in square brackets [n].  
Ideally submit word processed answers in a pdf, unless you have a relevant SoRA.  
Do not exceed the specified word counts.  
1. Brief bioinformatics paper on subcellular location prediction miniproject (existing  
coursework repurposed). Already submitted.   
[40 marks]  
[Total for Question 1: 40 marks]  
The diagram above shows a region near an origin of replication in a human  
chromosome. Using an exonuclease digestion method, which can isolate RNA-primed  
single-stranded DNA, the sequence of an Okazaki fragment has been obtained which  
overlaps with this region. The sequence is as follows:  
5’ – gccgcaccctgggcgaatggctacgtcaacgatcactcggctcaggtacat – 3’  
a) Assuming this region is known to comprise an exon open reading frame of a gene  
located on the top strand, determine the most likely amino acid translation. Use  
the standard genetic code table, show your working and explain the logic you have  
applied to arrive at your final answer.  
[2.5 marks]  
b) The genome of a newly discovered organism Mysteriosus thingi has just been  
sequenced and the resulting sequence data has been analysed. Use the following  
data to calculate the estimated fraction of the M. thingi genome that is coding  
(show your working). Based on your result, is M. thingi more likely to be a  
eukaryote or prokaryote and justify your choice.   
Number of genes in the genome = 6120  
Average molecular weight of proteins in Mt’s proteome = 22,000 a.m.u. (atomic  
mass units; 1 a.m.u. = mass of one carbon-12 atom)  
C-value = 4.5 x 106  
[2.5 marks]  
[Total for Question 2: 5 marks]  
Labelled Region  
Top  
Bottom  
3’  
5’  
3’  
5’ Origin  
3. a) For a NW global alignment of four sequences (w, x, y, z) of lengths Lw, Lx, Ly, Lz,  
the match score can be represented as S(wi, xj, yk, zl) in the case of no gaps, or S(wi,  
xj, -, zk) in the case of a gap being placed in the third sequence and so on. Using this  
notation, write out the recurrence formula and commented pseudocode for the four  
sequence NW global alignment algorithm, where M is the dynamic programming  
matrix, and the cost of inserting a single gap in one sequence is given by positive  
value d. Assume in your answer that the gap penalty will be applied for every  
individual gap position inserted i.e. the maximum total gap penalty that can be  
accumulated at any position in the dynamic programming matrix will be 3d.  
[10 marks]  
b) An alignment of four viral protein sequences is shown below:  
MIELSLIDFYLC  
MNELTLIDFYLC  
MLHLTLLDFYLL  
MIHLTLFDFYLC  
Calculate a regularized sequence profile, formatted as 20 rows of 12 columns, for the  
above small sequence family using the Laplace rule (pseudocount=1) as needed. Give  
the resulting relative frequencies to 3 d.p. and order the rows in 3-letter amino acid  
code order (Ala, Arg, Asn … Val). Show your working for the first column.  
[5 marks]  
c) Again, using same alignment, compute appropriate sequence weights for each  
sequence using average sequence dissimilarity weighting (using simple amino acid  
percentage identity as the similarity metric). Again, show your working.  
[5 marks]  
d) Using the sequence weights you have calculated, recompute your sequence profile  
to take sequence diversity into account i.e. by appropriately weighting the raw counts.  
Again, give the weighted frequencies to 3 d.p. and highlight any values which differ  
from the original profile.  
[2.5 marks]  
e) Finally, consider a new potential family member shown below in bold. This match  
was found by searching a large data bank of sequences (UniProt).  
MIELSLIDFYLC  
MNELTLIDFYLC  
MLHLTLLDFYLL  
MIHLTLFDFYLC  
|  | |     
MRKLNIVEYFVS  
Average % Amino Acid Composition in UniProt Data Bank:  
Ala:  8.25    
Arg:  5.53   
Asn:  4.06    
Asp:  5.45    
Cys:  1.37    
Gln:  3.93    
Glu:  6.75    
Gly:  7.07    
His:  2.27    
Ile:  5.96    
Leu:  9.66    
Lys:  5.84    
Met:  2.42    
Phe:  3.86    
Pro:  4.70    
Ser:  6.56    
Thr:  5.34    
Trp:  1.08    
Tyr:  2.92    
Val:  6.87    
Use your previously calculated weighted and regularized sequence profile, and the  
above average amino acid composition table (background frequencies), to compute  
the total LLR (Log Likelihood Ratio) score for the new sequence matched against  
your profile. The individual match scores and total score should be computed in bits  
(log base2). From your final answer, and the information above, give your judgement  
as to whether this new sequence can confidently be assigned to this family.  
[2.5 marks]  
[Total for Question 3: 25 marks]  
4. Results for a prototype experiment investigating the genes involved in skin cancer are  
given below. Assume this data has been normalized and that it shows log2 expression  
units as measured by an Affymetrix GeneChip machine. Both diseased samples and  
normal samples have 3 biological replicates. Only the first 4 genes are given to  
facilitate calculations.  
Genes Log Expression for  
Melanoma Samples  
Log Expression for  
Normal Skin Samples  
Array 1 Array 2 Array 3 Array 4 Array 5 Array 6  
Gene 1  8.8 8.6 8.8 7.4 7.8 7.6  
Gene 2 6.3 6.2 6.3 6.2 6.3 6.3  
Gene 3 8.3 8.2 8.3 8.6 8.8 8.8  
Gene 4 8.4 8.4 8.5 9.5 9.5 9.5  
... ... ... ...  ... ... ...  
... ... ... ...  ... ... ...  
Do the following tasks using whatever tools you wish (python with matplotlib,  
R, Excel, etc., or just a calculator doing manual calculations). Support your  
results with the code that you wrote if you are using a language/package or  
write down individual steps you carried out if you did it manually.  
(a) Draw an MvA plot for the above data clearly labelling the genes (use all three  
biological replicates to determine the average expression level for a  
particular gene in a particular condition).  
[4 marks]  
(b) Use Welch’s t-test to determine the p-value of whether there is a significant  
difference in gene expression of each gene going from normal samples to  
diseased samples.  
[3 marks]  
(c) Assume a large number of genes are measured in this experiment. Describe  
two reasons why many false predictions of differentially expressed genes  
would result when applying the t-test formula in this particular  
experimental design when using a p-value threshold of 5%. For each  
reasons, describe how the t-test calculation can be modified to decrease the  
number of false positives. [Maximum word limit of 200 words.]  
[4 marks]  
(d) Draw a volcano plot for the above experiment clearly showing the different  
genes.  
[4 marks]  
(e) Carry out hierarchical clustering of the samples using L1 (Manhattan block)  
distance metric and complete linkage. Provide the similarity matrix  
determined, the individual steps that you took during the clustering process  
(similar to the lecture), and the final dendrogram with samples clearly  
labelled and branch lengths annotated.  
[5 marks]  
(f) After this prototype experiment, the research lab wants to do a more  
definitive experiment analysing the genes involved in skin cancer. Write a  
report describing a more extensive experiment that would enable them to  
discover different unknown subtypes of skin cancer and also detect genes  
that are differentially expressed in each subtype compared to normal.  
Outline important aspects of the overall experimental methodology  
including issues that may need to be considered and possible solutions. Also  
explain how the resulting data would be analysed to satisfy the study  
objectives; and differences in methodology to the calculations carried out  
above. [Maximum word limit of 500 words.]  
[10 marks]  
[Total for Question 4: 30 marks]