1	    
Fundamentals	   of	   Computer	   Vision,	   Spring	   2020	    
Project	   Assignment	   2	    
(3D	   point	   to	   2D)	   Point	   and	   Inverse	   (2D	   point	   to	   3D	   ray)	   Camera	   Projection	    
Due	   Date:	   Sunday,	   April	   19,	   2020	   11:59pm	   EST	    
  
1	    Motivation	    
The	   goal	   of	   this	   project	   is	   to	   implement	   forward	   (3D	   point	   to	   2D	   point)	   and	   inverse	   (2D	   point	   to	    
3D	    ray)	    camera	    projection,	    and	    to	    perform	    triangulation	    from	    two	    cameras	    to	    do	    3D	    
reconstruction	   from	   pairs	   of	   matching	   2D	   image	   points.	   	   This	   project	   will	   involve	   understanding	    
relationships	    between	    2D	    image	    coordinates	    and	    3D	    world	    coordinates	    and	    the	    chain	    of	    
transformations	    that	    make	    up	    the	    pinhole	    camera	    model	    that	    was	    discussed	    in	    class.	    Your	    
specific	    tasks	   will	    be	    to	    project	    3D	    coordinates	    (sets	    of	    3D	    joint	    locations	    on	    a	    human	   body,	    
measured	   by	   motion	   capture	   equipment)	   into	   image	   pixel	   coordinates	   that	   you	   can	   overlay	   on	    
top	    of	    an	    image,	    to	    then	    convert	    those	    2D	    points	    back	    into	    3D	    viewing	    rays,	    and	    then	    
triangulate	    the	    viewing	    rays	   of	    two	    camera	    views	    to	    recover	    the	   original	    3D	    coordinates	    you	    
started	   with	   (or	   values	   close	   to	   those	   coordinates).	    
You	   will	   be	   provided:	    
•   3D	   point	   data	   for	   each	   of	   12	   body	   joints	   for	   a	   set	   of	   motion	   capture	   frames	   recorded	   of	   a	    
subject	    performing	    a	    Taiji	    exercise.	    	    The	    12	    joints	    represent	    the	    shoulders,	    elbows,	    
wrists,	    hips,	    knees,	    and	    ankles.	    Each	    joint	    will	    be	    provided	    for	    a	    time	    series	    that	    is	    
~30,000	   frames	   long,	   representing	   a	   5-‐minute	   performance	   recorded	   at	   100	   frames	   per	    
second	   in	   a	   3D	   motion	   capture	   lab.	   	    
•   Camera	   calibration	   parameters	   (Intrinsic	   and	   extrinsic)	   for	   two	   video	   cameras	   that	   were	    
also	   recording	   the	   performance.	   	   Each	   set	   of	   camera	   parameters	   contains	   all	   information	    
needed	   to	   project	   3D	   joint	   data	   into	   pixel	   coordinates	   in	   one	   of	   the	   two	   camera	   views.	   	   	    
•   An	    mp4	    movie	    file	    containing	    the	    video	    frames	    recorded	    by	    each	    of	    the	    two	    video	    
cameras.	   	   The	   video	   was	   recorded	   at	   50	   frames	   per	   second.	    
While	    this	    project	    appears	    to	    be	    a	    simple	    task	    at	    first,	    you	    will	    discover	    that	    practical	    
applications	   have	   hurdles	    to	   overcome.	   Specifically,	    in	   each	    frame	   of	   data	    there	   are	   12	    joints	    
with	   ~30,000	   frames	   of	   data	   to	   be	   projected	   into	   2	   separate	   camera	   coordinate	   systems.	   That	   is	    
over	    ~700,000	    joint	    projections	    into	    camera	    views	    and	    ~350,000	    reconstructions	    back	    into	    
world	    coordinates!	    	    Furthermore,	    you	    will	    need	    to	    have	    a	    very	    clear	    understanding	    of	    the	    
pinhole	    camera	    model	    that	    we	    covered	    in	    class,	    to	    be	    able	    to	    write	    functions	    to	    correctly	    
project	   from	   3D	   to	   2D	   and	   back	   again.	    
The	   specific	   project	   outcomes	   include:	    
•   Experience	   in	   Matlab	   programming	   	    
2	    
•   Understanding	   intrinsic	   and	   extrinsic	   camera	   parameters	    
•   Projection	   of	   3D	   data	   into	   2D	   images	   coordinates	   	    
•   Reconstruction	   of	   3D	   locations	   by	   triangulation	   from	   two	   camera	   views	    
•   Measurement	   of	   3D	   reconstruction	   error	    
•   Practical	   understanding	   of	   epipolar	   geometry.	   	   	    
  
2	    The	   Basic	   Operations	    
The	   following	   steps	   will	   be	   essential	   to	   the	   successful	   completion	   of	   the	   project:	    
1.   Input	   and	   parsing	   of	   mocap	   dataset.	   Read	   in	   and	   properly	   interpret	   the	   3D	   joint	   data.	    
2.   Input	    and	   parsing	   of	    camera	   parameters.	    Read	    in	    each	    set	   of	    camera	   parameters	    and	    
interpret	   with	   respect	   to	   our	   mathematical	   camera	   projection	   model.	    
3.   Use	   the	   camera	   parameters	   to	   project	   3D	   joints	    into	   pixel	    locations	    in	   each	   of	   the	   two	    
image	   coordinate	   systems.	    
4.   Reconstruct	    the	    3D	    location	    of	    each	    joint	    in	    the	    world	    coordinate	    system	    from	    the	    
projected	   2D	   joints	   you	   produced	   in	   Step3,	   using	   two-‐camera	   triangulation.	    
5.   Compute	   Euclidean	   (L²)	   distance	   between	   all	   joint	   pairs.	   This	   is	   a	   per	   joint,	   per	   frame	   L²	    
distance	    between	    the	    original	    3D	    joints	    and	    the	    reconstructed	    3D	    joints	    providing	    a	    
quantitative	   analysis	   of	   the	   distance	   between	   the	   joint	   pairs.	    
2.1	   Reading	   the	   3D	   joint	   data	    
The	   motion	   capture	   data	   is	   in	   file	   Subject4-‐Session3-‐Take4_mocapJoints.mat	   .	   	   Once	   you	   load	   it	    
in,	   you	   have	   a	   21614x12x4	   array	   of	   numbers.	   	   The	   first	   dimension	   is	   frame	   number,	   the	   second	    
is	   joint	   number,	   and	   the	   last	   is	   joint	   coordinates	   +	   confidence	   score	   for	   each	   joint.	   	   Specifically,	    
the	   following	   snippet	   of	   code	   will	   extract	   x,y,z	   locations	   for	   the	   joints	   in	   a	   specific	   mocap	   frame.	   	   	    
mocapFnum = 1000;  %mocap frame number 1000  
x = mocapJoints(mocapFnum,:,1);     %array of 12 X coordinates  
y = mocapJoints(mocapFnum,:,2);     %  Y coordinates  
z = mocapJoints(mocapFnum,:,3);     %  Z coordinates  
conf = mocapJoints(mocapFnum,:,4)   %confidence values  
Each	    joint	   has	   a	   binary	   “confidence”	   associated	   with	    it.	    Joints	   that	   are	   not	   defined	    in	   a	   frame	    
have	   a	   confidence	   of	   0.	   Feel	   free	   to	   Ignore	   any	   frames	   don’t	   have	   all	   confidences	   =	   1.	    
There	   are	   12	   joints,	   in	   this	   order:	    
1 Right shoulder  
2 Right elbow  
3 Right wrist  
4 Left shoulder  
5 Left elbow  
6 Left wrist  
7 Right hip  
3	    
8 Right knee  
9 Right ankle  
10 Left hip  
11 Left knee  
12 Left ankle  
2.2	   Reading	   camera	   parameters	    
There	    are	    two	    cameras,	    called	    “vue2”	    and	    “vue4”,	    and	    two	    files	    specifying	    their	    calibration	    
parameters:	   vue2CalibInfo.mat	   and	   vue4Calibinfo.mat	   .	   	   Each	   of	   these	   contains	   a	   structure	   with	    
intrinsic,	   extrinsic,	   and	   nonlinear	   distortion	   parameters	   for	   each	   camera.	   	   Here	   are	   the	   values	   of	    
the	   fields	   after	   reading	   in	   one	   of	   the	   structures	    
vue2 =    
struct with fields:  
foclen: 1557.8  
orientation: [-0.27777 0.7085 -0.61454 -0.20789]  
position: [-4450.1 5557.9 1949.1]  
prinpoint: [976.04 562.82]  
radial: [1.4936e-07 4.3841e-14]  
aspectratio: 1  
skew: 0  
Pmat: [3×4 double]  
Rmat: [3×3 double]  
Kmat: [3×3 double]  
  
Part	   of	   your	   job	   will	   be	   figuring	   out	   what	   those	   fields	   mean	   in	   regards	   to	   the	   pinhole	   camera	    
model	   parameters	   we	   discussed	   in	   class	   lectures.	   	   Which	   are	   the	   internal	   parameters?	   	   Which	    
are	   the	   external	   parameters?	   	   Which	   internal	   parameters	   combine	   to	   form	   the	   matrix	   Kmat?	   	    
Which	   external	   parameters	   combine	   to	   form	   the	   matrix	   Pmat?	   	   Hint:	   the	   field	   “orientation”	   is	   a	    
unit	   quaternion	   vector	   describing	   the	   camera	   orientation,	   which	   is	   also	   represented	   by	   the	   3x3	    
matrix	   Rmat.	   	   What	   is	   the	   location	   of	   the	   camera?	   	   Verify	   that	   location	   of	   the	   camera	   and	   the	    
rotation	   Rmat	   of	   the	   camera	   combine	   in	   the	   expected	   way	   (expected	   as	   per	   one	   of	   the	   slides	   in	    
our	   class	   lectures	   on	   camera	   parameters)	   to	   yield	   the	   appropriate	   entries	   in	   Pmat.	    
2.3	   Projecting	   3D	   points	   into	   2D	   pixel	   locations	    
Ignoring	   the	   nonlinear	   distortion	   parameters	   in	   the	   “radial”	   field	   for	   now,	   write	   a	   function	   from	    
scratch	   that	   takes	   either	   a	   single	   3D	   point	   or	   an	   array	   of	   3D	   points	   and	   projects	   it	   (or	   them)	   into	    
2D	   pixel	   coordinates.	   	   You	   will	   want	   to	   refer	   to	   our	   lecture	   notes	   for	   the	   transformation	   chain	    
that	   maps	   3D	   world	   coordinates	   into	   2D	   pixel	   coordinates.	    
For	    verification,	    it	   will	    be	    helpful	    to	    visualize	    your	    projected	    2D	    joints	    by	    overlaying	    them	   as	    
points	   on	   the	   2D	   video	   frame	   corresponding	   to	   the	   motion	   capture	   frame.	   Two	   video	   files	   are	    
given	    to	   you:	    Subject4-‐Session3-‐24form-‐Full-‐Take4-‐Vue2.mp4	    is	    the	   video	    from	   camera	   vue2,	    
and	   Subject4-‐Session3-‐24form-‐Full-‐Take4-‐Vue4.mp4	    is	    the	   video	   from	   camera	   vue4.	    	   To	   get	   a	    
video	   frame	   out	   of	   the	   mp4	   file	   we	   can	   use	   VideoReader	   in	   matlab.	   	   It	   is	   nonintuitive	   to	   use,	   so	    
4	    
to	   help	   out,	   here	   is	   a	   snippet	   of	   code	   that	   can	   read	   the	   video	   frame	   from	   vue2	   corresponding	   to	    
the	   motion	   capture	   frame	   number	   mocapFnum.	    
%initialization of VideoReader for the vue video.    
%YOU ONLY NEED TO DO THIS ONCE AT THE BEGINNING  
filenamevue2mp4 = 'Subject4-Session3-24form-Full-Take4-Vue2.mp4';  
vue2video = VideoReader(filenamevue2mp4);  
%now we can read in the video for any mocap frame mocapFnum.  
%the (50/100) factor is here to account for the difference in frame  
%rates between video (50 fps) and mocap (100 fps).  
vue2video.CurrentTime = (mocapFnum-1)*(50/100)/vue2video.FrameRate;  
vid2Frame = readFrame(vue2video);  
The	    result	    is	    a	    1088x1920x3	    unsigned	    8-‐bit	    integer	    color	    image	    that	    can	    be	    displayed	    by	    
image(vid2Frame).	    
If	    all	    went	    well	    with	    your	    projection	    of	    3D	    to	    2D,	    you	    should	    be	    able	    to	    plot	    the	    x	    and	    y	    
coordinates	    of	    your	    2D	    points	    onto	    the	    image,	    and	    they	    should	    appear	    to	    be	    in	    roughly	    the	    
correct	   places.	   	   IMPORTANT	   NOTE:	   since	   we	   ignore	   nonlinear	   distortion	   for	   now,	   it	   might	   be	   the	    
case	   that	   your	   projected	   points	    look	   shifted	   off	   from	   the	   correct	    image	   locations.	    	   That	    is	   OK.	   	    
However,	   if	   the	   body	   points	   are	   grossly	   incorrect	   (body	   is	   much	   larger	   or	   smaller	   or	   forming	   a	    
really	   weird	   shape	   that	   doesn’t	    look	    like	    the	   arms	   and	    legs	   of	    the	   person	    in	    the	    image),	    then	    
something	   is	   likely	   wrong	   in	   your	   projection	   code.	    
2.4	   Triangulation	   back	   into	   a	   set	   of	   3D	   scene	   points	    
As	   a	   result	   of	   the	   above	   step,	   for	   a	   given	   mocap	   frame	   you	   now	   have	   two	   sets	   of	   corresponding	    
2D	   pixel	   locations,	   in	   the	   two	   camera	   views.	   	   Perform	   triangulation	   on	   each	   of	   the	   12	   pairs	   of	   2D	    
points	   to	   estimate	   a	   recovered	   3D	   point	   position.	   	   As	   per	   our	   class	   lecture	   on	   triangulation,	   this	    
will	    be	    done,	    for	    a	    corresponding	    pair	    of	    2D	    points,	    by	    converting	    each	    into	    a	    viewing	    ray	    
represented	   by	   camera	   center	   and	   unit	   vector	   pointing	   along	    the	    ray	   passing	    through	    the	   2D	    
point	   in	   the	   image	   and	   out	   into	   the	   3D	   scene.	   	   You	   will	   then	   compute	   the	   3D	   point	   location	   that	    
is	   closest	   to	   both	   sets	   of	   rays	   (because	   they	   might	   not	   exactly	   intersect).	   	   Go	   back	   and	   refer	   to	    
our	   lecture	   on	   Triangulation	   to	   see	   how	   to	   do	   the	   computation.