Detecting Deep Fakes with Camera Root of Trust

I wrote this a few years ago and shopped it around to a few orgs. No one was interested then, but deep fakes continue to be a big issue. I’m posting here in case it inspires anyone.

Recent improvements in Deep Learning models have brought the creation of undetectable video manipulation within reach of low resource groups and individuals. Several videos have been released showing historical figures giving speeches that they never made, and detecting that these videos are fabricated is very difficult for a human to do. These Deep Fake videos and images could have a negative impact on democracy and the economy by degrading public trust in reporting and making it more difficult to verify recorded events. Moreover, Deep Fake videos could negatively impact law enforcement, defense, and intelligence operations if adversarial actors feed incorrect data to these organizations.

These problems will only increase as Deep Learning models improve, and it is imperative that a solution be found that enables trusted video recording. One naïve way of detecting Deep Fakes would be to fight fire with fire, and create Deep Learning models that can detect manipulated video and images. This mitigation method is likely to lead to a Red Queen race in which new defensive models are constantly being superseded by new offensive models. To avoid this, I recommend the adoption of hardware root-of-trust for modern cameras. Combined with a camera ID database and automated validation software plug-ins, hardware root-of-trust would allow video producers to provably authenticate their footage, and video consumers to detect alterations in videos that they watch.

The proposed hardware system will allow for the production of provably authentic video, but it will not allow old video to be authenticated. It provides a positive signal of authenticity, but videos and images that do not use this system remain suspect.

In order to minimize the risk of Deep Fakes, a system similar to the one proposed here must be driven to near ubiquity. The ubiquity of this system will cause videos that are not produced with this system to be viewed with suspicion, increasing the difficulty of fooling viewers with Deep Fakes. If, on the other hand, many videos are produced without using such a system, then the lack of this system’s positive signal will not inspire enough distrust in viewers and Deep Fakes may still proliferate.

Hardware Root-of-Trust for Video

In order to trust an image or video, a viewer needs to be sure that:

The video was created by a real camera
The video was not tampered with between creation and viewing

These two goals can be accomplished by the use of public key cryptography, fast hashing algorithms, and cryptographic signing.

I propose the integration of a cryptographic coprocessor on the same silicon die as a camera sensor. Every frame produced by the sensor would then include metadata that validates that frame as being provably from that specific camera. Any changes to the pixels or metadata will lead to a detectable change in the frame.

Secure Frame Generation

When the camera sensor is being manufactured, the manufacturer will cause a public/private key pair to be generated for that sensor. The sensor itself will generate the keys, and will contain the private key on an un-readable block of memory. Only the public key will ever be available off-chip. The sensor will also have a hardware defined (set in silicon) universally unique identifier. The sensor manufacturer will then store the camera’s public key and sensor ID in a database accessible to its customers.

No changes need to be made in the hardware design or assembly process for imaging hardware. Secure sensors can be included in any device that already uses such sensors, like smartphones, tablets, and laptops.

The consumer experience of creating an image or video is also unchanged. Users will be able to take photos and video using any standard app or program. All of the secure operations will happen invisibly to the user.

Whenever a secure image is taken using the sensor, the sensor will output the image data itself along with a small metadata payload. The metadata will be composed of the following:

The ID of the camera
The plaintext hash of the prior secure frame grabbed by this camera sensor
A cryptographically signed package of:
- The hash of the current secure image frame
- The hash of the prior secure image frame

The inclusion of the hash of the prior frame allows users to ensure that no frames were inserted between any two frames in a video. When the image or video is displayed in a normal viewer, the metadata will not be observable.

Secure Frame Validation

Any user who wishes to validate the video or image will need to run the following procedure (which can be automated and run in the background using e.g. a browser plug-in):

Read the metadata of the first frame
Look up the public key of the image sensor by using its ID (from the metadata) with the sensor manufacturer’s database
Hash the first frame
Compare the hash of the first frame to the signed hash included within the frame’s metadata (requires public key of the sensor)
In a video, subsequent frames can be validated in the same way. Frame continuity can be validated by comparing the signed prior-hash in a frame’s metadata with the calculated hash of the prior frame in the video.

Viewers can be certain that the image or video is authentic if the following criteria are met:

The sensor ID is the same in all frames
The signed image hash matches the calculated image hash for all frames
The signed hash was created using the private key that corresponds to the public key retrieved using the sensor’s ID
Each frame’s signed prior hash matches the hash from the prior frame in the video (not necessary for single images or the first frame in a video)

If any of the above criteria fail, then the viewer will know that an image was tampered with.

Implementation Plan

Prototype Hardware

The system described above can be prototyped using an FPGA and off-the-shelf camera sensor. A development board can be created that connects the camera’s MIPI CSI interface directly to the FPGA. The FPGA will be configured to implement the cryptographic hashing and signing algorithms. It will then transmit the image and metadata over a second MIPI CSI interface to the device processor. In effect, the prototype will have an FPGA acting as a man-in-the-middle to hash and sign all images.

The FPGA will be configured with a cryptographic coprocessor IP core. In addition to the hashing and signing algorithm, the core will also handle the following command and control functions:

Generate a new public/private key pair
Divulge public key
Lock device (prevent re-generation of public/private keys)
Invalidate (delete public/private key pair and lock device)
Query device ID
Set device ID (for FPGA prototype only; actual hardware will have ID defined at fabrication time)
Enable authenticatable frames (hash upcoming frames)
Disable authenticable frames (stop hashing upcoming frames)

The IP core on the FPGA would use an I2C communication interface, the same as the control interface for most CMOS camera sensors. Two options exist for communicating with the FPGA.

The FPGA is a second device on the I2C bus with its own address. The application processor would have to know about it and use it explicitly.
The FPGA acts as an I2C intermediary. The application processor would talk to the FPGA assuming that it was the camera IC, and any non-cryptographic commands would be forwarded to the camera itself. This method is more similar to the final hardware, in which the crypto engine is embedded on the same die as the camera sensor.

Validation Tools

The validation tools can be separated into the server-based camera lookup database and client-based video analysis software. The video analysis software can be written as a library or plug-in and released publicly, allowing the creation of codecs and apps for commonly used software.

During the prototyping and proof of concept, these libraries can be created and several test plug-ins written for video players. This will then serve as a useful base for the productization phase of the project.

Productization

While the FPGA-based prototype described above serves as a useful proof-of-concept, the end-product will need the cryptography engine to be located on the same die (or at least the same IC) as the camera sensor. This ensures that the images can’t be tampered with between the CMOS sensor itself and the cryptographic signing operation.

I propose that the cryptographic engine IP used on the FPGA be open-sourced and a consortium formed between camera sensor manufacturers (e.g. Omnivision) and integrators (e.g. Amazon). The consortium can be used to drive adoption of the system, as well as refining the standards.

Initially, cameras that include this security system may be more expensive than cameras that do not. We propose creating a higher tier product category for secure image sensors. These can be marketed at government, intelligence, and reporting organizations. As old product lines are retired and new ones come online, manufacturers can phase out image sensors that do not include this secure system.

Funding Progression

A small team implementing the initial prototype hardware could be funded by contracts from organization that stand to benefit the most from such hardware, such as DARPA. If the prototyping team were a small company, they could potentially find SBIRs that would suffice.

A large company, such as Amazon, may wish to invest in secure camera systems to improve their security camera offerings. Stakeholders in this plan would also benefit from the positive press that would result from fighting Deep Fakes.

After the initial proof-of-concept and IP development, the cryptographic engine must be integrated into image sensor ICs. The large investment required for this could come from secure camera manufacturers directly, or from potential customers for such a system.

After the first round of authenticable image sensors is available in the market, expansion will be fundable by reaching out to customers such as news organizations, human rights organizations, etc.

Open Questions

Data format conversion and compression is very common with video. How will a signed video be compressed and maintain authenticability?
Do there exist cryptographically compatible compression algorithms?
Can we create a cryptographically compatible compressed video format?

Risks

Attackers may attempt to hack the central database of camera IDs and public keys. If successful, this will allow them to present a fake video as credibly real. It would also allow them to cause real videos to fail validation.
Attackers may attempt to hack the validation plug-ins directly, perhaps inserting functionality that would lead to incorrect validation results for videos.
Provably authentic video could lead to more severe blackmail methods.
If this system does not achieve high penetration, then Deep Fakes could still proliferate that claim to be from insecure camera sensors. Only by achieving high market penetration will the return on investment of Deep Fakes fall.
A sufficiently dedicated attacker could create a Deep Fake video, acquire a secure camera, and then play the video into the camera using a high-definition screen. This would then create a signed Deep Fake video. To mitigate this issue, high security organization (e.g. defense or intelligence communities) are encouraged to keep blacklists and whitelists of specific camera sensors.
- This risk can be mitigated in several ways. The most straightforward may be to include an accelerometer on the camera die as well. Accelerometer data could be signed and included in image frame metadata. Analysis could then be performed to correlate accelerometer data with video frames to ensure that their motion estimates agree.
Privacy and anonymity may be threatened if the public keys of an image can be identified as being from a specific camera/phone owned by a specific person. Ideally, the databased published by the camera manufacturers include only public keys and device serial numbers. The consumer device manufacturers would then be advised to randomize imagers among devices so it is harder to identify a photographer from their image. Additionally, zero-knowledge proof techniques should be investigated to improve privacy and anonymity while maintaining image-verification ability.