Media Asset Management
Identification, Filtering and Recognition of Content Usage

Challenges of vast, multidimensional content growth 


The vast, multidimensional growth of Content on all kinds of Media brings new challenges for almost all market participants. Originators and Owners of content want to know where and by whom their content is used. Distributors of Content want to make sure that they and their clients will be compliant within the usage rights. Users of content want smart and straight forward methods to find content and combine content from several sources legally fully compliant to their new own product (e.g. in News).  
The post focusses mainly on the technical side of this challenge and only touches the legal side as far as technical requirements originate from the legal side. 

It goes through several methods of content identification via its Video/Image and Audio portion and highlights the advantages of certain available technologies like Watermarking and Fingerprinting, outlining strengths and weaknesses of the specific methods for certain tasks.

It also dives into the area of “Hybrid AI”, when it gets to identification of content on the Internet, for which neither Classical AI methods nor Classical Model Based Methods (like Fingerprinting) can be sufficient, but adequate combination of both.

                    Definitions, Terminology, State of the Art


The following standard terms are used:
        - Content: Essence and Metadata
                    o Essence: Video, Audio or Data
                    o Metadata: Data description
        - Asset: Content and Rights

                    Media Content - Essence and Metadata

For different usages within the system, the video and audio data are encoded in different levels of quality and formats.

The details of this coding are not subject of this article, they are subject to permanent further development, and there are also worldwide differences with regards to the respective region-specific preference.

Modern fingerprinting and watermarking methods must be able to handle all formats used worldwide, and almost all fulfill this requirement. Furthermore, companies must always be in a position to react promptly upon changes to standards, otherwise the SW solution would become obsolete for customers.

With the H264/HEVC method (adaptive block length) and all its derivatives commonly used today, the originally transmitted files are adapted to the target end devices on a Network Abstraction Layer (NAL). There are already other reasons in the nature of video and audio files why hash-based methods are not adequate. The target-device dependent methods added one more reason: the structure of the files, as well as its binary content, is different for each type of target-device.

In addition, efficient content management poses relevant challenges for television broadcasters and other rights owners, for which e.g. fingerprinting methods offer solutions.

Before broadcasting the original source material various cuts, it is often re-cut, distributed and, if necessary, recoded several times. If the material has metadata, this is usually limited to the title and, in the case of series, possibly to the subtitle series, e.g. : "Edgar Wallace"; "The Red Archer", or for football: "Germany-France"; ">>Date<<".

In case re-cut material is generated from the above, the information of what has been created by whom and where, and which content portions were used for it, is lost with classical methods. This means that at such system boundaries, an almost complete loss of the rights information occurs without further measures.

Although this information can also be restored manually, it is often omitted or insufficiently carried out as this is associated with a great deal of effort. With the help of sophisticated fingerprinting methods, a fully automatic restoration of the assignment of rights can now be carried out simply and efficiently in large archives, even in cut foreign material. The examples below will discuss this in more detail.

                     Media Assets - Content and Rights

An efficient and credible rights management is a key component for the future of digital media content archives. Due to the highly interconnected market for digital content, knowledge about multiple usage in different formats and the pursuit of their rights is very relevant for producers and providers. Digital rights pose complex challenges for the integration into media archives and are a major organizational task for the correct integration into current workflows.

                    Media Asset Management System

The integration of a media asset management system into the entire workflow is important for the avoidance of media disruptions. One of the most important tasks of a content management system in general is the retrieval and reuse of archived material. This helps reducing the costs for new productions.

Depending on the usage of a media archive, the individual workflow differs. A standardized "master workflow" can no longer be easily defined today, as users pursue a wide variety of objectives.

However, the basic steps remain the same:

        - Coding
        - Annotation (manual, semi-automatic, fully automatic)
        - Storing
        - If required, transfer to long-term archive according to ageing parameters
        - If necessary, recovery

These basic steps from the original source material to the final archiving occur with many complements in all archive solutions. A fingerprinting or watermarking solution must be integrated into these customer-specific workflows at a suitable position.

In addition to archiving, the goal of media asset management is the efficient searching and browsing of reference content, its metadata and its corresponding rights. Typically, all reference content inserted into the media archive is linked to specific metadata. The basic procedure consists of assigning a set of additional information to the entire media object. This limits the search for specific scenes within a longer reference clip. To overcome this limitation, an additional indexing process is required. There are two strategies for indexing reference material, stratification and segmentation.

The process of segmentation is simple. The original reference material is automatically cut into smaller, independent media objects. This facilitates the assignment of different metadata to parts of the original clip and independent access to these sub-clips from the media archive. This procedure poses challenges regarding the granularity of the segments and the correct annotation with the corresponding metadata.

The second approach, stratification, consists of using the spatial-temporal properties of the reference clips. Stratification does not require cutting the material, instead it uses virtual sections explicitly defined and accessible by the corresponding timecode (each frame of the reference clip has a unique identifying timecode). Metadata can be assigned to a specific area in the timecode (temporal). This enables a completely unrestricted and overlapping assignment of different information to the reference clip. This process can be executed manually or supported by semi-automated processes (e.g. face recognition, pattern recognition, speech-to-text algorithms). In addition to temporal layering, it is even possible to spatially segment the frames and assign independent metadata.

In the case of archives, a distinction must also be made among purely static, no longer changing parts, and dynamic parts, which are permanently augmented and also changed in the area of the near past, e.g. in the metadata part. With dynamic archives, the efficient allocation and creation of new entries becomes very important.

The retrievability of the material is the actual reason for the existence of the archive.
In addition to the obvious method of searching directly via textual annotations, video and audio-based search methods have also become established in parallel or in a complementary manner.
In both video and audio, a distinction must be made between fingerprint and watermark based methods.

The generally accepted terminology is used here. This defines the Watermark methods as the methods in which an invisible or inaudible identification code is written into the material (a "watermark" in the literal sense). And the fingerprint procedures are then the methods in which a characteristic, unique code is generated from the material itself using an algorithm (a "fingerprint" in the literal sense). This characteristic makes it possible to recognize the material without having to insert anything into it.

Today, modern fingerprinting methods recognize not only identical but also similar contents if the parameters are set appropriately. The latter is, as explained in more detail in the following chapter, an important feature in some applications. (Therefore, it should be noted that the term "DNA" in analogy to criminalistics would now be more appropriate than the established term "fingerprint”, because these methods also find "relatives").

For important copyright reasons, it should be mentioned that the content can be unambiguously re-identified with the help of fingerprints. However, with all mature methods, it is mathematically impossible to recreate the source material from the fingerprint. (here in contrast to the "DNA" from which one can theoretically recreate the living being).

This characteristic implies significant degrees of judicial freedom when sending fingerprints.

Workflow for fingerprinting, simple integration as serial or parallel path during import

Figure 1: Workflow for fingerprinting, simple integration as serial or parallel path during import

     Comparison of different methods including an assessment 

Audio vs. video for watermarking and fingerprinting 

Audio is first compared with video.

Audio-based detection methods have been in use for a long time and are very mature, both in terms of their basic robustness and their efficiency.

Video-based methods have only been slowly gaining ground for 10 to 15 years. While at the beginning they required a lot of computing power, this is actually no longer an issue due to their development. Audio still requires fewer resources, but the differences are economically negligible in the overall result. On the one hand, the video algorithms have been considerably improved and on the other hand, computer performance has increased enormously. There are still reasons for audio solutions. For example, when it is a matter of the pure audio channel because it contains the characteristics of the material, such as music or unchanged life interviews. Or, of course, trivial when it comes to radio material or the like.

In view of the progress made with today's very robust video-based methods, however, it is more than questionable whether it makes sense to waive the content of the video channel for identification purposes. After all, it's much more rich in information. And it can also bring clear efficiency advantages when expanding the archive. If, for example, new TV program parts and commercials are to be identified, annotated and archived throughout Europe, there are many repetitions due to the nature of this system. With different languages in the audio part, a new contribution must be found and annotated at least once "first" in each country. On the video track, however, it appears for the first time in one country and is then already found in other countries.

In the case of news material, the video material is also often used and underlaid with completely new speaker text. In such a case the use of pure audio identification in the corresponding archive and in other solution scenarios is a priori prohibited.

In addition, modern video-based methods outperform audio-based methods in terms of system robustness. It can be said that audio techniques are as mature and robust today as they can be. But they have the implicit drawback that their channel basically has considerably less information than video. And for the clear differentiation of two data objects, the amount of information content is the decisive factor.

Nevertheless, it is unwise to abandon the information of the audio channel. Therefore the most modern methods today use both, video and audio, and present as a result of their search the synthesis of both.  

Fingerprinting vs. watermarking on the audio and video channel

In contrary to audio vs. video, the methods "fingerprinting" and "watermarking" are fundamentally different from each other in their way of creation and application. Both are, as explained above, applied to audio and video content, but in completely different manifestations.
And both have their advantages and disadvantages, depending on the application.

The disadvantages of watermarking arise first hand from the more complex workflow. The watermark MUST be written into the material before it is sent out for the first time. If, for whatever reason, this does not happen, a copy that can never be identified again is in circulation, and the copies of this copy can also no longer be found. When storing, the same effect occurs with the opposite sign: in the case of archives marked with a watermark, a watermark must be written into the material and stored in the Watermark DB when FIRST finding a new content.

This is not essential for fingerprinting, even if it is not recognized as "new" until the 5th or 10th time. For example, if several countries from different time zones are working on a central Content DB, all past playouts are recognized and assigned automatically a posteriori without any problems.

In the past, Watermark methods had the advantage of requiring fewer computing resources for recognition, but today this advantage is marginalized in view of the development described above.

The disadvantage of the more complex workflow has led many organizations in practical use to abandon watermarking. It proved difficult to really ensure that all new content had been correctly watermarked BEFORE it was first sent out.

However, there is one area where watermarking is still the only reasonable solution. To ensure that ONE particular copy needs to be identified, e.g. in the case of a forensic copyright infringement investigation. If it is important to determine where the leakage originated, e.g. in the case of cinema material or copies via certain remarketing channels. Again for clarification: the pure compliance violation and the possibly necessary restoration of the rights information can be recognized excellently by fingerprinting, even with strongly modified and/or from different sources composed material. Not, however, traceability to a single, specially labelled copy. The latter requires watermarking.

In the above cases, there is no reason why watermarking and fingerprinting should not be used in combination in order to exploit the respective advantages of both methods.

In more recent areas, on the other hand, there are significant advantages of fingerprinting:

"Images within an image" or split-screen blending can be detected. This is very important, for example but not only, for news programmes, sports programmes, etc..

Recognition of an image within an image

Figure 2: Recognition of Image in Image

In the above examples one can clearly see how far the identification of deviations proceeds.

Automatic recognition of an image within a split screen

Figure 3: Recognition of Image in Image

Not only basic overlays are harmless, but also when inserting the original material into a completely different frame image, the scene is correctly recognized.

Detection of Video Deviations with Video Fingerprinting

Figure 4: Detection of Video Deviations with Video Fingerprinting

It often happens that the exact identification of the deviation is important, e.g. if a modified annotation is to be created. The above examples show exactly how even the smallest changes are highlighted and thus brought to the operator's attention. 

The procedures available today are, as the examples show in extracts, robust even after heavy editing in space and time and reliably recognize the similarity of the material.

A comparison can be carried out fully automatically, even against aged material, and it is possible to compare frame by frame if necessary.

Even shots of the same event from different camera positions can be detected. This sometimes plays a role when it comes to rights issues in sports.

The modern methods scale excellently because they are close to the inherently parallel designed. This means that more processors can be used if required, with almost linear scaling.

And in contrast to more laborious watermarking, they can be easily integrated into any workflow and any IS infrastructure.


Every assessment attempt is subject to an unavoidable degree of subjectivity; therefore, no attempt is made to convey the appearance of objectivity to the judgement by inserting what is described above into assessment tables. Because also then the question about the weighting-factors arises immediately.

According to the author's assessment, a combination (synthesis) of classical text annotation in combination with fingerprint methods that use both video and audio information is currently the most suitable method for state-of-the-art archives.

If it becomes necessary to enable individual forensic identification, a watermarking method might be used in combination with the fingerprinting method for this specific case, but it will not replace the fingerprinting method.


With the further developments in the field of semantic methods and deep learning, leading technology providers in the field of video and audio techniques have long since incorporated these new techniques into their solutions or are about to introduce them to the market or are in the process of significantly improving methods that have already been introduced.

The following is a direct glimpse into the laboratories of these SW boutiques.

While the previous parts described how today's robust state-of-the-art technology can appear in the daily production environment of archive solutions, this section shows what becomes possible with more recent methods.

The most important new element is semantics in either modeled or AI (Artificial Intelligence) manifestation.
Semantics in the context of e.g. video identification means the automatic recognition of the meaningful context of a scene.

To make it clearer with an example from the sport see Figures 4 to 9. Here different types of scenes from a football broadcast are automatically classified by a hybrid system in order to enable more specific archive queries and annotations of the archived material.

The semantic video recognition software automatically recognizes the scene types and uses this self-created information for further classification.

Studio scene automatically recognized by semantic system

Figure 4, 5: Studio scenes are automatically recognized

Interview with soccer players in front of wall automatically recognized by the system

Figure 6, 7: Interview with soccer players in front of wall is automatically clustered

Soccer goal scene automatically detected by the system through cognitive computing

Figure 8: Soccer goal scene is automatically detected

Playing field and players perimeter recognized and differentiated by the system through cognititve computing

Figure 9: Playing field, players and perimeter recognized and differentiated by the system

Figure 9 shows how the system differentiates between the playing field, player and perimeter. After this step, the subsequent detection and counting of logos, for example, is more robustly possible (if the background is known) and the background can be added to the recognized logo as a metadata.

Hybrid means that these methods perform the initial, essential coarse identifications using classical video detection. Only when this "scaffold" or "skeleton" is in place, i.e. when it is clear in which rough direction the search is going, AI methods are added for fine adjustment. By combining both worlds, an extremely high accuracy rate of well over 90% can be achieved, making these methods suitable for production in the first place. Pure AI methods, on the other hand, typically end up at over 60%, in favorable cases at around 80%. The hybrid approaches, on the other hand, are robust and reliable even in the most challenging cases.

Due to the extremely large volume of data in the video and audio sector, the reliability of the solution is an important asset, because every error requires the intervention of an operator, which rapidly makes operating costs uneconomical. This automatic scene classification is becoming increasingly important for all archive applications, not only in sports, and is becoming highly relevant from an economic point of view.

For example, customers want to automatically count the visibility of logos in a broadcast. It also depends on the respective duration of the visibility and on the object on which the logo appears (see Figure 9), e.g. on the perimeter advertising, the player T-shirt, an advertisement wall, a logo billboard during a player interview etc.. Also, knowledge about the type of game scene gets more and more important, e.g. goal scene, midfield scene or penalty shot scene (see figues 4 - 8). Depending on the scene, the TV viewer is more or less involved, which has an influence on his capacity to absorb advertising. With the penetration of digital, dynamic perimeter advertising, it is therefore essential which advertising appears for which scene.

Since consumers are also addressed directly with personalized advertising on smart phones due to other techniques, this detailed information gains great statistical significance in operational campaign management, but also in classic advertising impact analysis. Because these deeper insights into the data allow finer granular predictions.

Further use cases are the generation of annotation suggestions on the basis of semantic analysis, both of the audio and the video track. In the audio field, semantic speech recognition methods can achieve much more accurate results than pure word based methods. In view of the amount of material to be archived, any increase in efficiency that reduces the use of the operator is again very important.

The further development remains exciting. Due to the described newer approaches, it is to be expected that the functional power and thus the commercial value of archives will be significantly increased.

ivitec GmbH

Lange Reihe 29
20099 Hamburg

+49 6151 60 60 789