辅导COMP0082辅导Python编程

Bioinformatics, COMP0082 (A7U, A7P)
Main Summer Alternative Assessment, 2019/20

There are FOUR questions in total.
Answer all FOUR questions.
Marks for each part of each question are indicated in square brackets [n].
Ideally submit word processed answers in a pdf, unless you have a relevant SoRA.
Do not exceed the specified word counts.

1. Brief bioinformatics paper on subcellular location prediction miniproject (existing
coursework repurposed). Already submitted.
[40 marks]

[Total for Question 1: 40 marks]

The diagram above shows a region near an origin of replication in a human
chromosome. Using an exonuclease digestion method, which can isolate RNA-primed
single-stranded DNA, the sequence of an Okazaki fragment has been obtained which
overlaps with this region. The sequence is as follows:

5’ – gccgcaccctgggcgaatggctacgtcaacgatcactcggctcaggtacat – 3’

a) Assuming this region is known to comprise an exon open reading frame of a gene
located on the top strand, determine the most likely amino acid translation. Use
the standard genetic code table, show your working and explain the logic you have
applied to arrive at your final answer.

[2.5 marks]
b) The genome of a newly discovered organism Mysteriosus thingi has just been
sequenced and the resulting sequence data has been analysed. Use the following
data to calculate the estimated fraction of the M. thingi genome that is coding
(show your working). Based on your result, is M. thingi more likely to be a
eukaryote or prokaryote and justify your choice.
Number of genes in the genome = 6120
Average molecular weight of proteins in Mt’s proteome = 22,000 a.m.u. (atomic
mass units; 1 a.m.u. = mass of one carbon-12 atom)
C-value = 4.5 x 106

[2.5 marks]

[Total for Question 2: 5 marks]

Labelled Region
Top
Bottom
3’
5’
3’
5’ Origin

3. a) For a NW global alignment of four sequences (w, x, y, z) of lengths Lw, Lx, Ly, Lz,
the match score can be represented as S(wi, xj, yk, zl) in the case of no gaps, or S(wi,
xj, -, zk) in the case of a gap being placed in the third sequence and so on. Using this
notation, write out the recurrence formula and commented pseudocode for the four
sequence NW global alignment algorithm, where M is the dynamic programming
matrix, and the cost of inserting a single gap in one sequence is given by positive
value d. Assume in your answer that the gap penalty will be applied for every
individual gap position inserted i.e. the maximum total gap penalty that can be
accumulated at any position in the dynamic programming matrix will be 3d.

[10 marks]

b) An alignment of four viral protein sequences is shown below:

MIELSLIDFYLC
MNELTLIDFYLC
MLHLTLLDFYLL
MIHLTLFDFYLC

Calculate a regularized sequence profile, formatted as 20 rows of 12 columns, for the
above small sequence family using the Laplace rule (pseudocount=1) as needed. Give
the resulting relative frequencies to 3 d.p. and order the rows in 3-letter amino acid
code order (Ala, Arg, Asn … Val). Show your working for the first column.

[5 marks]
c) Again, using same alignment, compute appropriate sequence weights for each
sequence using average sequence dissimilarity weighting (using simple amino acid
percentage identity as the similarity metric). Again, show your working.

[5 marks]

d) Using the sequence weights you have calculated, recompute your sequence profile
to take sequence diversity into account i.e. by appropriately weighting the raw counts.
Again, give the weighted frequencies to 3 d.p. and highlight any values which differ
from the original profile.

[2.5 marks]

e) Finally, consider a new potential family member shown below in bold. This match
was found by searching a large data bank of sequences (UniProt).

MIELSLIDFYLC
MNELTLIDFYLC
MLHLTLLDFYLL
MIHLTLFDFYLC
| | |
MRKLNIVEYFVS

Average % Amino Acid Composition in UniProt Data Bank:
Ala: 8.25
Arg: 5.53
Asn: 4.06
Asp: 5.45
Cys: 1.37
Gln: 3.93
Glu: 6.75
Gly: 7.07
His: 2.27
Ile: 5.96
Leu: 9.66
Lys: 5.84
Met: 2.42
Phe: 3.86
Pro: 4.70
Ser: 6.56
Thr: 5.34
Trp: 1.08
Tyr: 2.92
Val: 6.87

Use your previously calculated weighted and regularized sequence profile, and the
above average amino acid composition table (background frequencies), to compute
the total LLR (Log Likelihood Ratio) score for the new sequence matched against
your profile. The individual match scores and total score should be computed in bits
(log base2). From your final answer, and the information above, give your judgement
as to whether this new sequence can confidently be assigned to this family.

[2.5 marks]

[Total for Question 3: 25 marks]

4. Results for a prototype experiment investigating the genes involved in skin cancer are
given below. Assume this data has been normalized and that it shows log2 expression
units as measured by an Affymetrix GeneChip machine. Both diseased samples and
normal samples have 3 biological replicates. Only the first 4 genes are given to
facilitate calculations.

Genes Log Expression for
Melanoma Samples
Log Expression for
Normal Skin Samples
Array 1 Array 2 Array 3 Array 4 Array 5 Array 6
Gene 1 8.8 8.6 8.8 7.4 7.8 7.6
Gene 2 6.3 6.2 6.3 6.2 6.3 6.3
Gene 3 8.3 8.2 8.3 8.6 8.8 8.8
Gene 4 8.4 8.4 8.5 9.5 9.5 9.5
... ... ... ... ... ... ...
... ... ... ... ... ... ...

Do the following tasks using whatever tools you wish (python with matplotlib,
R, Excel, etc., or just a calculator doing manual calculations). Support your
results with the code that you wrote if you are using a language/package or
write down individual steps you carried out if you did it manually.
(a) Draw an MvA plot for the above data clearly labelling the genes (use all three
biological replicates to determine the average expression level for a
particular gene in a particular condition).
[4 marks]
(b) Use Welch’s t-test to determine the p-value of whether there is a significant
difference in gene expression of each gene going from normal samples to
diseased samples.
[3 marks]
(c) Assume a large number of genes are measured in this experiment. Describe
two reasons why many false predictions of differentially expressed genes
would result when applying the t-test formula in this particular
experimental design when using a p-value threshold of 5%. For each
reasons, describe how the t-test calculation can be modified to decrease the
number of false positives. [Maximum word limit of 200 words.]
[4 marks]

(d) Draw a volcano plot for the above experiment clearly showing the different
genes.
[4 marks]
(e) Carry out hierarchical clustering of the samples using L1 (Manhattan block)
distance metric and complete linkage. Provide the similarity matrix
determined, the individual steps that you took during the clustering process
(similar to the lecture), and the final dendrogram with samples clearly
labelled and branch lengths annotated.
[5 marks]
(f) After this prototype experiment, the research lab wants to do a more
definitive experiment analysing the genes involved in skin cancer. Write a
report describing a more extensive experiment that would enable them to
discover different unknown subtypes of skin cancer and also detect genes
that are differentially expressed in each subtype compared to normal.
Outline important aspects of the overall experimental methodology
including issues that may need to be considered and possible solutions. Also
explain how the resulting data would be analysed to satisfy the study
objectives; and differences in methodology to the calculations carried out
above. [Maximum word limit of 500 words.]
[10 marks]

[Total for Question 4: 30 marks]