Getting Started with Kinesis Content Event Stream

A much simpler alternative to Kinesis Streams... IFX

Kinesis Streams can be difficult to use, have a high learning curve, and require a stateful application. Each user action (for instance, updating a story) emits dozens of messages and you will need to parse each message to know what criteria to look for. Creating an application for this can take months.

A much simpler alternative to this is IFX's Content Events which emits a single event for each use action. Customers are able to create an integration and get it into production in far less time. You can learn more about IFX here.

Background

The Content API generates a real-time stream of data as changes arrive. This stream reflects changes to both published and unpublished content anywhere in the document body. The data from the stream can be used to synchronize an external CMS with Arc XP, to extract the current state of certain content within Arc XP as it changes, or to perform limited real-time analytics on publishing changes.

Getting Access

The Content Kinesis Stream can be made available to your AWS account upon request. This capability only supports AWS consumers. Contact Arc XP Customer Support to get the setup process started.

Retrieving the data from the streams

A full explanation of consuming data from Kinesis is available from the AWS Kinesis Documentation.

Kinesis streams consist of shards and records. Records are ordered within a single shard. At a high level, your application must consume from each shard in sequence to retrieve new records as they arrive and process them. This document will refer to this part of your application as your consumer.

Retrieving the Arc payload

Kinesis limits the size of a single record to 1 MB. However, the Content API places messages on the stream that contain the entire state of the document at the time of the update. Since these documents are often more than 1MB in size, Content API will place some documents into S3 instead of in the record directly. In these cases, the record will contain a pre-signed S3 URL to retrieve the full payload.

Note

To ensure your consumer is coded to handle this case, some number of updates are randomly saved to S3 regardless of payload size.

Each Kinesis record will contain either the Content API payload itself (JSON) or a pre-signed S3 URL to retrieve the payload. In either case, the payload will be gzipped, so your consumer must retrieve and decompress it.

Your business logic for retrieving the payload should look something like this:

for record in records['Records']:
   data = record['Data']
   # Initial payload will be compressed
    payload = zlib.decompress(data, 15+32).decode("utf8")
    body = None

See also the sample consumer code On Github.

Content Operations

Once you've begun retrieving payloads, you'll notice that they all have a similar format, described by the Content Operation Schema.

Let's dive into the data available here.

“type”: “content-operation” This should always be present. If it's not present, you are parsing the wrong document.

“organization_id”: “the-herald” This will always be the subdomain of arcpublishing.com that you use to access Arc.

“operation”: “insert-story” The “operation” identifies what changed in the Content API.

The Content API stores metadata for four document types (story, gallery, video, redirect). The data for a single document of one of these types can only be replaced (insert) or deleted (delete).

For the purposes of Content API, an update event is the same as an insert event, so both will show in the Kinesis stream as insert.

“date”: “2018-01-01T12:00:00.0+00:00” The RFC3339-formatted date describing when this operation was completed in the Content API.

“id”: “ABCDEFGHIJKLMNOPQRSTUVWXYZ”, “branch”: “default”, “published”: true The fields id, branch and published collectively form the document key.

A single document may exist in Content API at any time for each distinct document key. For each id + branch, both a published and nonpublished version of the document may co-exist simultaneously. In this way, you can see real-time updates from *both* the draft and published document versions. However, this can also lead to some confusion when processing updates. See the “gotchas” section below.

“created”: false On insert operations, this will be set to `true` if a new document was inserted. (For instance, if no document with the same document key existed in Content API at the beginning of the operation.)

“body”: ... The updated version of the affected document.

“trigger”: {...} The trigger object contains metadata about the input event that caused the change in the document.

This is useful for distinguishing downstream updates from source updates. If the id and type of the affected document are identical to the id and type of trigger document, then the update was generated by a user editing the document directly.

But if the trigger document fields are different (for example, the trigger has “type”:“image” and “id”:“DEF” and the affected document has “type”:“story” and “id”:“ABC”) then a user updated the trigger document, and this update in turn caused the affected document to update. The trigger may contain the following five fields.

“trigger.type”: “image” The document type that a user altered which triggered the enclosed document to change.

“trigger.id”: “DEF” The id of the document that was modified by a user.

“trigger.referent_update”: true If this update was triggered indirectly, this will be true.

“trigger.priority”: “ingestion” May be ingestion or standard. Updates with ingestion may be pending in the processing queue for longer.

“trigger.app_name”: “composer” The app where the triggering user action took place, if available.

Concerns

Tracking the right document

One key thing to be aware of when consuming updates is the structure of the *document key*, described above. Each document in Content API is uniquely identified by id + branch + published.

This means that processing updates to draft and published documents may be unintuitive. For example, suppose your consumer receives updates in sequence with the following values:

{ “id”: “ABC”, “branch”:“default”, “published”: false, “operation”: “insert-story” }

{ “id”: “ABC”, “branch”:“default”, “published”: true, “operation”: “insert-story” }

{ “id”: “ABC”, “branch”:“default”, “published”: true, “operation”: “insert-story” }

{ “id”: “ABC”, “branch”:“default”, “published”: false, “operation”: “insert-story” }

{ “id”: “ABC”, “branch”:“default”, “published”: true, “operation”: “insert-story” }

At first glance, it may appear that document “ABC” was saved, then published, published again, then unpublished, and then republished. However, that is incorrect.

In fact, what this sequence represents is a series of updates to two copies of the document. In this example, the draft (non-published) document was updated, followed by two updates to the published document, then another edit to the draft version, and finally, another update to the published document.

How can I detect a publish state change?

The surest way is to have your consumer track state internally and update when it sees an insert or a delete for the document in question. However, this requires statefulness on the application side, which is often undesirable.

Often, organizations want to know when a piece of content is first published, so they can send a notification to subscribers. The best way to accomplish this is to look for events that meet these criteria:

published: true
canonical_url is not empty (or websites.<WEBSITE>.website_url is not empty)
first_publish_date = publish_date (this will only be true on 1st publish for most cases)

Later publish events can use only the first two criteria, as publish_date is typically updated past the 1st publish.

For the unpublish event, you can simply look for ”operation” = “delete-story”.

Why are there sometimes two events after publish?

If a second event appears after publishing, it represents that a URL has been generated and assigned to the piece of content.

Why?

The actions of publishing (making content publicly available) and circulating (assigning sites and sections) are distinct in Arc XP. This allows for flexible workflows around who, how, and when content makes it to the public web.

Because of this separation, URL Generation is also a separate action that occurs after a piece of content is both published and circulated. Since publishing and circulating can happen independently, URL generation is not explicitly tied to publishing, hence a separate event that provides this information.

If you follow a “standard” flow in Composer, publish and URL generation events should occur back-to-back on your Kinesis stream. However, non-standard API workflows could mean that publish events happen well before URL generation events.

What does this mean for the `created` field?

As stated earlier, created = true will show when a piece of content first appears in Content API. For a document whose URL needs to be generated, you will typically observe this flow of events:

Story is published, URL is empty, ”created” = true
Story is published a 2nd time, URL has been generated, ”created” = false (since the insert in Content API occurred in the 1st event)

So created alone cannot be used to determine whether a document is “first published”, in the sense that most consumers will want to wait for the document to obtain a URL before using it.

Resources

AWS Kinesis Documentation

Arc Kinesis Consumer Example (Java)

Arc Kinesis Consumer Example (JavaScript / Node.Js)

ANS Schema

In this section: