Synthesis AI’s Face API: Outputs

Overview

The Face API is a data generating service that can create millions of labeled images of faces with a specifiable variety of genders, ethnicities, ages, expressions, eye gazes, hair, hats, glasses, and more.

A request (also known as a “job”) is typically submitted by using a command line client application provided by Synthesis AI, and a job ID is returned. This job ID is referenced when the callback is returned, notifying the user that the job and all of its tasks have completed.

Some key terminology: a job has many tasks (aka scenes) associated with it. Each task has many assets, which include both text and binary metadata, like an EXR, metadata.jsonl, and *.info.json, and optionally several extracted, standalone binary images from the EXR channels like rgb, segments, depth, normals, and alpha images (example). More information on how to download all of these files via API exists at Asset Download API.

For each task, the only two unique asset types are EXR (a multi-channel OpenEXR file) and *.info.json (computed values for each task not included at image request time). The other PNG & TIF files are extracted from layers in the EXR.

The metadata.jsonl file is also unique, but typically contains every task in a single file (unless requested differently).

We provide a face_api_dataset library in Python for parsing all of the output files.

Finally, there is an example of complete output of a single scene/task in the documentation appendix.


Metadata JSON Lines File

The metadata is the settings for each image made at request time, like pose of camera, lighting setup, and face attributes. These are simply what were requested at job creation time, so this document does not go into additional detail.

Note that is a JSON lines file, and that each line contains the full json that fully represents the scene. In the example below, "render_id":188 means that this line corresponds to 188.exr/188.info.json.

The value for version in each line of metadata.jsonl is incremented if there are structural changes to the format of the file, so you can write code that targets a specific version and issue warnings or errors on an unexpected version.

...
{
  "version": "1.1",
  "task_id": "da755eb9-a9a2-416e-bcf3-b6421622290a",
  "render_id": 188,
  "render": {
    "resolution": {
      "w": 512,
      "h": 512
    },
    "noise_threshold": 0.1,
    "samples": null,
    "engine": "vray",
    "hair": null,
    "denoise": true
  },
  "scene": {
    "id": 3457,
    "body": {
      "type": 1,
      "enabled": true
    },
    "camera": {
      "nir": {
        "enabled": false,
        "intensity": 0,
        "size_meters": 0
      },
      "location": {
        "depth_meters": 1.4412577,
        "vertical_angle": 0,
        "offset_vertical": 0,
        "horizontal_angle": 4.7856603,
        "offset_horizontal": 0,
        "truck_centimeters": 0,
        "pedestal_centimeters": 0
      },
      "specifications": {
        "focal_length": 100,
        "sensor_width": 33,
        "lens_shift_vertical": 0,
        "lens_shift_horizontal": 0,
        "window_offset_vertical": 0,
        "window_offset_horizontal": 0
      }
    },
    "version": 0,
    "clothing": {
      "type": 0,
      "enabled": true,
      "variant": 5
    },
    "accessories": {
      "mask": {
        "style": "ClothHead",
        "variant": 0,
        "position": 3
      },
      "glasses": {
        "style": "none",
        "metalness": 0,
        "transparency": 1,
        "lens_color_rgb": [255, 255, 255]
      },
      "headwear": {
        "style": "none"
      },
      "headphones": {
        "style": "0010_wiredEarbuds"
      }
    },
    "environment": {
      "hdri": {
        "name": "freight_station",
        "rotation": 24,
        "intensity": 1.0301735
      },
      "lights": [],
      "geometry": "none",
      "location": {
        "id": 0,
        "enabled": false,
        "facing_direction": 0,
        "head_position_seed": 0,
        "indoor_light_enabled": false
      }
    },
    "facial_attributes": {
      "gaze": {
        "vertical_angle": 13.074141,
        "horizontal_angle": -8.255154
      },
      "hair": {
        "color": "reddish_blonde",
        "style": "female_hair_long_simple_1",
        "color_seed": 0,
        "relative_length": 0.75177073,
        "relative_density": 1
      },
      "head_turn": {
        "yaw": 14.693872,
        "roll": -17.496176,
        "pitch": 9.588731
      },
      "expression": {
        "name": "none",
        "intensity": 0.5190663
      }
    },
    "identity_metadata": {
      "id": 3457,
      "age": 47,
      "gender": "female",
      "ethnicity": "white",
      "height_cm": 168,
      "weight_kg": 60
    }
  }
}
...

*.info.json File

The *.info.json files represent computed values for each task, not fully known nor included at job request time. See also: example *.info.json in appendix. We also provide a face_api_dataset library in Python for parsing this file.


Contents

Each info json file contains *computed* information about the scene it’s numbered with (e.g. 11.info.json is for 11.exr/11.rgb.png):

  • Pupil Coordinates
  • Facial Landmarks (iBug 68-like)
  • Camera Settings
  • Eye Gaze
  • Segmentation Values
Coordinate System

It’s important to note the several different axes that are referenced. We use a right-hand coordinate system. The world and camera axes are aligned Y-Up, X-Right, with negative Z pointing forward. At zero rotation, the head’s axes align with world axes, with it looking along the +ve Z-axis, as shown below.

Coordinate System: The camera is located along the +ve Z-axis and the Head is located along the +ve Y-axis (with no rotation)

Pupil Coordinates

3D pupil coordinates are available in the camera view axes, world axes, and 2D screen pixel coordinates.

  • screen_space_pos:
    The u,vx,y pixel coordinates of the pupils, in % of the width and height of the full image. This value is derived from the scene generation application by using a camera-UV projection method to map point positions into a “zero to one” UV space that matches screen space.

  • camera_space_pos:
    A point’s 3D coordinates in the camera’s reference frame. This value is calculated by multiplying the camera extrinsic matrix (world2cam 4x4 transformation matrix) with the point’s world space coordinates.

  • world_space_pos:
    The 3D coordinates of a point in the world reference frame. These values are provided directly by the scene generation application.

An example is:

"pupil_coordinates": {
    "pupil_left": {
        "screen_space_pos": [
            0.5784899592399597,
            0.5743290781974792
        ],
        "camera_space_pos": [
            0.03564847633242607,
            -0.03375868871808052,
            -1.3762997388839722
        ],
        "distance_to_camera": 1.3771750926971436,
        "world_space_pos": [
            0.030104853212833405,
            0.27054131031036377,
            0.10770561546087265
        ]
    },
    "pupil_right": {
        "screen_space_pos": [
            0.44264423847198486,
            0.525549054145813
        ],
        "camera_space_pos": [
            0.03564847633242607,
            -0.03375868871808052,
            -1.3762997388839722
        ],
        "distance_to_camera": 1.3855191469192505,
        "world_space_pos": [
            0.030104853212833405,
            0.27054131031036377,
            0.10770561546087265
        ]
    }
}


Facial Landmarks

Our facial landmarks output are placed in *similar*, but not exactly the same as, the iBug 68 landmark locations. The primary difference is that occluded landmarks are placed corresponding to their actual 3D location. I.e., occluded landmarks are not “displaced” to match visible face contours..


Sample code for parsing these landmarks in Python is at: https://github.com/Synthesis-AI-Dev/project-landmarks-to-image

An example is:

"landmarks": [
  {
    "screen_space_pos": [
      0.382,
      0.49
    ],
    "ptnum": 0,
    "camera_space_pos": [
      -0.05711684003472328,
      0.004818509332835674,
      -1.467186450958252
    ],
    "distance_to_camera": 1.4683057069778442,
    "world_space_pos": [
      -0.05475452169775963,
      0.30911850929260254,
      0.009396456182003021
    ]
  },
  ...
]


Camera

Computed camera information is output as follows:

  • intrinsics:
    The camera pinhole matrix, used by many libraries like in opencv. It is defined by:
    [[fx, 0, cx],
    [0, fy, cy],
    [0, 0, 1 ]],
    where fx,fy=focal length in pixels and cx,cy=principal point (center) of the image.

  • field_of_view: Vertical and horizontal field of view, provided in radians.

  • focal_length_mm: Essentially, the magnification/power of the lens.

  • sensor width_mm & height_mm: Physical measurement of the dimensions of the camera sensor's active pixel area.

  • transform_world2cam & transform_cam2world: Rigid transformation matrices (rotation+translation) between world and camera coordinate systems. Offered in two formats, a 4x4 matrix and a translation with a rotation quaternion. They are both offered to reduce floating point math rounding errors. The world2cam transform is used to convert a point in world coordinates to camera coordinates.

An example is:

"camera": {
  "intrinsics": [
    [
      1551.5151515151515,
      0.0,
      256.0
    ],
    [
      0,
      1551.5151515151515,
      256.0
    ],
    [
      0,
      0,
      1.0
    ]
  ],
  "field_of_view": {
    "y_axis_rads": 0.32705323764198635,
    "x_axis_rads": 0.32705323764198635
  },
  "focal_length_mm": 100.0,
  "transform_world2cam": {
    "mat_4x4": [
      [
        0.9965137705108044,
        0.0,
        0.0834284434850595,
        -0.0033371377394023813
      ],
      [
        0.0,
        1.0,
        0.0,
        -0.3043
      ],
      [
        -0.0834284434850595,
        0.0,
        0.9965137705108045,
        -1.4811182508204321
      ],
      [
        0.0,
        0.0,
        0.0,
        1.0
      ]
    ],
    "translation_xyz": [
      -0.0033371377394023813,
      -0.3043,
      -1.4811182508204321
    ],
    "quaternion_wxyz": [
      0.0,
      0.041750625679117394,
      0.0,
      0.9991280624901906
    ],
    "euler_xyz": [
      0.0,
      4.785660300000001,
      0.0
    ]
  },
  "sensor": {
    "width_mm": 33.0,
    "height_mm": 33.0
  },
  "transform_cam2world": {
    "mat_4x4": [
      [
        0.9965137705108045,
        0.0,
        -0.08342844348505951,
        -0.12024188657185686
      ],
      [
        0.0,
        1.0,
        0.0,
        0.3043
      ],
      [
        0.08342844348505951,
        0.0,
        0.9965137705108045,
        1.47623314490473
      ],
      [
        0.0,
        0.0,
        0.0,
        1.0
      ]
    ],
    "translation_xyz": [
      -0.12024188657185686,
      0.3043,
      1.47623314490473
    ],
    "quaternion_wxyz": [
      0.0,
      -0.041750625679117394,
      0.0,
      0.9991280624901906
    ],
    "euler_xyz": [
      0.0,
      -4.785660300000001,
      0.0
    ]
  }
}


Eye Gaze

Eye gaze is computed for convenience as angles from the head’s point of view, as well as a vector in world coordinate space.

  • horizontal_angle & vertical_angle:The direction in degrees from looking straight ahead. An eye looking up and to the left, from the character’s point of view, would have a positive horizontal angle and a positive vertical angle.

  • gaze_vector:The vector from the centroid of the eye volume, to the location of the pupil in 3d world-space coordinates.

An example is:

"gaze_values": {
  "eye_left": {
    "horizontal_angle": -8.255141258239746,
    "gaze_vector": [
      0.377139687538147,
      -0.2047073394060135,
      -0.9032500386238098
    ],
    "vertical_angle": 13.074122428894043
  },
  "eye_right": {
    "horizontal_angle": -8.255141258239746,
    "gaze_vector": [
      0.377139687538147,
      -0.2047073394060135,
      -0.9032500386238098
    ],
    "vertical_angle": 13.074122428894043
  }
}


Segmentation

The segmentation mapping values explain which pixel value represents each area of the face.

The segmentation output file (accessed with --segments in the CLI) is a PNG file (example) containing different pixel values for each area of the face. Each *.info.json (example) file contains the mapping of what pixel value represents each area. Our face_api_dataset library contains python functions for parsing this file.

Example segmentation mapping section:

"segments_mapping": {
  "background": 0,
  "body": 1,
  "brow": 2,
  "cheek_left": 3,
  "cheek_right": 4,
  "chin": 5,
  "clothing": 6,
  "ear_left": 7,
  "ear_right": 8,
  "eye_left": 9,
  "eye_right": 10,
  "eyelashes": 11,
  "eyelid": 12,
  "eyes": 13,
  "forehead": 14,
  "hair": 15,
  "head": 16,
  "headphones": 17,
  "jaw": 18,
  "jowl": 19,
  "lip_lower": 20,
  "lip_upper": 21,
  "mask": 22,
  "mouth": 23,
  "mouthbag": 24,
  "neck": 25,
  "nose": 26,
  "nose_outer": 27,
  "nostrils": 28,
  "shoulders": 29,
  "smile_line": 30,
  "teeth": 31,
  "temples": 32,
  "tongue": 33,
  "undereye": 34
}

EXR File

The exr has many channels, which are also automatically extracted to other asset files which can be downloaded as well.

Channel Name Description Example Value

RGBA

Beauty pass with alpha

Z

Z depth

Min/max is zero to 100 meters.

Black = 0 meters

White = 100 (or infinite) meters

Z-depth reference range is included in following outputs:

EXR Header

‘zmin', 'zmax’

info.json

Under ‘render’ inside ‘zmin’ and ‘zmax’ values

**Image has been level adjusted to see value ranges in thumbnail.

CameraNormals.RGB

Gives you the surface normal at the point being sampled, in camera coordinate space. The "surface normal" is a vector that points directly away from the surface (at right angles to it) at that point.

CameraPoints.RGB

Gives you the position of the point being sampled, in camera coordinate space.

defocusAmount

The defocusAmount (Defocus Amount) render element is non-black only when depth of field and motion blur are enabled, and contains the estimated pixel blurring in screen space.

Required pass when denoising is enabled.

denoiser.RGB

The VRayDenoiser render element, when generated, contains the final image that results from noise removal.

Similar to effectsResults, but without lens effects added. Required pass when denoising is enabled.

diffuse.RGB

The Diffuse Render Element shows the colors and textures used in materials, flatly applied to objects with no lighting information.

diffuse_hair.RGB

Diffuse render element for hair elements

effectsResults.RGB

Denoised beauty pass.

Required pass when denoising is enabled.

noiseLevel

The noiseLevel (Noise Level) render element is the amount of noise for each pixel in greyscale values, as estimated by the V-Ray image sampler.

Required pass when denoising is enabled.

normals.XYZ

The Normals Render Element creates a normals image from surface normals in the scene. It stores the camera space normal map using the geometry's surface normals.

rawDiffuseFilter.RGB

The Raw Diffuse Filter Render Element is similar to the Diffuse Render Element, except it is not affected by Fresnel falloff. The result is a solid mask showing the pure diffuse color as set in the V-Ray Material's settings.

reflectionFilter.RGB

The Reflection Filter Render Element stores reflection information calculated from the materials' reflection values in the scene. Surfaces with no reflection values set in its material(s) will contain no information in the render pass and will therefore render as black.

refractionFilter.RGB

The Refraction Filter Render Element is an image that stores refraction information calculated from the materials' refraction values in the scene. Materials with no refraction values appear as black, while refractive materials appear as white (maximum refraction) or gray (lesser amounts of refraction).

worldNormals.XYZ

The Normals Render Element creates a normals image from surface normals in the scene. It stores the world space normal map using the geometry's surface normals.

worldPositions.XYZ

It stores the world space surface positions.

surface

mask of all surfaces in the scene excluding skin. combined with the following surface.skin mask, it will be a mask of everything in the scene

surface.skin

masks objects labeled as “skin”

segmentation.mustache

mask for mustache only. does not include beard portions

segmentation.jaw

segmentation.hair

mask for head hair only. does not include other hairs including mustache, bearch, eyebrows, etc.

segmentation.eyelid

segmentation.neck

segmentation.eye_right

segmentation.glasses_lens

segmentation.jowl

segmentation.mouth

segmentation.clothing

segmentation

empty pass for creation

segmentation.eyes

segmentation.ear_left

segmentation.teeth

segmentation.cheek_right

segmentation.cheek_left

segmentation.eyelashes

segmentation.lip_upper

segmentation.chin

segmentation.shoulders

segmentation.nose_outer

segmentation.head

segmentation.eye_left

segmentation.beard

segmentation.mouthbag

segmentation.glasses_frame

segmentation.lip_lower

segmentation.tongue

segmentation.brow

segmentation.undereye

segmentation.ear_right

segmentation.nose

segmentation.smile_line

segmentation.temples

segmentation.nostrils

segmentation.forehead

segmentation.mask

binary_segmentation

empty pass for creation

binary_segmentation.foreground

Masks objects labeled as “foreground”. Replaces Alpha as a way to separate the character from the background.