Factors Impacting Content Migration

Whether one is implementing a new system or moving from one system to another, content migration is always an important aspect. Each situation is unique and so each migration scenario will have its own roadmap. However, there are some common factors that are present in each migration and can determine how long will the migration last. I’m listing down a few. If you think there are other factors as well, please feel free to comment.

In order to take stock of these factors, one needs to follow a good migration approach and spend decent amount of time in analysis. I will not go into details of such an approach – There are quite a number of good articles on these approaches and in particular I like this one by James Robertson. This post is also not about the importance of content analysis, governance and other best practices  🙂

So here are some factors. At this point in time, these are only high level thoughts and I will probably expand these as and when I get time. So Feedback most welcome.

Source(s) of Content

Where and how the original content lives is probably the most important factor defining migration. It could be in a flat file system, database, another content management system, another application or somewhere else. It is important to understand whether it is stored in a proprietary format or not and how easy is it to access it. Obviously content stored in a table in a relational database is easier  to access as compared to something that is stored in a proprietary format.

Type of Content

Content Type could be Media (Images, Video, Audio), Documents (DOC, XLS, PDFs), text (html, xml), database or something else. Migration time lines are hugely dependent on this – migrating X-RAY images where each file could be couple of MBs or more has different challenges than migrating small text files. And when done across multiple environments, the effort only multiplies.

Closely related to this is what the content actually contains? So for example, do you need to migrate multiple languages, encodings and charactersets?

Quality of content

Pranshu gave me this example on twitter the other day. He was involved in a migration scenario in which the headlines of news articles were actually images. So even though migration of body and other fields could be automated, there was a good amount of manual intervention required to convert image headlines to text headlines for destination. Some other examples could be:

  • do all html files follow a template?
  • content with inconsistent metadata (like Male/Female Vs M/F)
  • content with missing metadata that could be mandatory on destination system
  • How much of the content is still relevant?

Amount of Content

Amount as well as the size of files is very relevant. In case of document scenarios, it is important because huge files take time to move across and in case of web content scenarios, it is important especially when things need manual intervention.

Destination System

Is the target system a content management system or is it something else? Does it support the fields that you require or do you need workarounds? Does it provide some mechanism to ingest content? Does it support your metadata and permissioning requirements?

Transformation or Value Add required

In my opinion, this is the factor that is probably the most important. The amount of transformation required between source and destination can actually define how much automation is possible and how much manual intervention is required. If you were to do an “as is” migration, things would possibly be trivial. So for example:

  • Title field in Source needs to be mapped to headline field in destination
  • Do all source fields need to be migrated?
  • Is there a need to define additional fields?
  • Is there a need to transform fields based on constraints (for example an ID in the source CMS would be stored as “123-45-6789” where as in the new CMS, “-” would not be permitted and it needs to be stored as “123.45.6789”)
  • Data cleansing
  • Do you need other value adds (like SEO, copyrighting and so on)?
  • Do you need to repurpose the same content for say delivery to mobile devices?
  • Are there links between files that need to be preserved? (like an XLS embedded within a Doc)
  • Do you want to migrate only the latest version or all versions? what happens to content that is part of an incomplete workflow?

Users and Roles

The difference in how users, roles and the whole permissioning system works in source as compared to destination also plays an important role. This is dependent on capabilities of the system as well as how comprehensively your organization has defined these. In some cases, just like data mapping, you might also need to map these for permissions etc. Read permission in source could be mapped to view permission in Destination. There would also be cases when there is no one to one mapping of permissions between source and destination.

Amount of automation possible

Based on some of the above factors, you will have an idea of how much of the migration can be automated. The extent of automation is dependent on source as well as destination systems:

  • Does source allow export of content?
  • Does destination allow import of content?
  • Are 3rd party products for analysis and migration being used?
  • Do these products allow ETL kind of activities?
  • etc

Roll out

How you want to roll out the new system also impacts your time lines. In scenarios where there are multiple geographies or multiple business units involved, it could be tricky. The reasons for these are more organizational and less related to technology. So whether you do a big bang roll out or do a phased roll out impacts the migration process.

Parallel Run

This is in some way related to the point above. Will Source and Destination systems be required to run in parallel? If yes, content will have to reside possible at both places and if users continue to modify content during the migration, you have to consider doing multiple iterations.

Infrastructure and Connectivity

The speed at which content can be moved across, exported, imported or ingested is also dependent on the connectivity between source, destination, databases etc.

So do you have similar experiences? Are there any other factors that can impact migration time lines?

(Thanks to @pranshuj, @rikang and @lokeshpant for inputs)

Advertisements

Random Notes on EMC World

These are some observations, in no particular order. I will possibly post some “more sensible” posts on specific topics later.

  • It was first time for me at EMC World and I thought the focus was much more on storage and infrastructure as compared to content management. They did certainly much better though in terms of integrating CMA (Content Management and Archival) with the overall EMC World. A lot of people who I talked to thought it was actually much better than that in the past when CMA folks felt quite out of place.
  • A big theme at the conference was about building social communities. Joe Tucci, the EMC Chairman started his key note with some statistics on tweets about the EMC World. He spoke about how EMC is working to give its customers more choice, better control and improved efficiencies. There was a dedicated blogger’s lounge, set up by Len Devanna and his team, which provided a great informal environment for bloggers and tweeps to come together and socialize. I am glad I was able to meet Laurence (pie), Len and Stu. There were other lounges on similar lines and in particular, the Momentum lounge provided a good place for Documentum users to meet.
  • Then there was CMA president Mark Lewis’ key note. He talked of ROI as return on information.
  • I was particularly interested in EMC’s initiatives around Customer Communication Management (or rather around their xPression product which came via the acquisition of Doc Sciences). Although, there were a few (and good) sessions on this, I was hoping for a bigger presence. They had a small, not so prominent booth within a large EMC booth.
  • Another interesting announcement (although this was done a couple of days before EMC World) was about free availability of the developer edition of Documentum. I think this is a great move to increase usage and acceptance of Documentum. EMC claims it takes 23 minutes to get up and running with Documentum, although i suspect it will take much more to download it – It is almost a 2 GB download and has steep RAM requirements (recommended 4 GB although 3 GB would work too) and so it would not be as easy to run it (on a laptop) as it is with some other products.This will essentially enable developers to get their hands dirty which in turn will help in more spreading of Documentum.  The developer edition comes bundled with Jboss and SQL Server Express database.
  • Some claimed that there were 7000 attendees but I felt the number was lower. I also think that number of customers, especially those interested in content management were far less than previous times. Although there were quite a few partners, the big partners were noticeable by their absence.
  • CMIS was reasonably covered. There was a dedicated session by Laurence and Karin Ondricek as well as Victor Spivak covered it in his session on D 6.5 architecture. Laurence demoed the federated CMIS sample application and according to him, the fact that Alfresco and Nuxeo allowed their servers to be up for Documentum conference showed the high amount of cooperation happening on CMIS.
  • Victor was quite clear about the scope of CMIS and more importantly what it is not. According to him, “I” is the most important letter in the acronym and in that sense, the objective is to provide interoperability and not implement more sophisticated features. And so the focus is only on basic services, mashup type of applications and not real business applications which are best handled by proprietary APIs (like DFS) or CMS specific features. He also said If you were to describe 6.5 release in 1 sentence, it would be “high volume services”.
  • There were quite a few sessions on WCM and more “Delivery oriented” aspects like Dynamic delivery, site management, Web 2.0, RIAs and so on. EMC has also latched on to the term Web Experience Management (WEM), something that Vignette and Fatwire have been using for some time. Web Publisher is not yet as sophisticated a platform for WCM and it remains to be seen how they do it.
  • Most of the sessions were EMC specific and by EMC and I think the number of independent sessions should be increased. I attended the one by Jeetu Patel of Doculabs in which he talked about different type of ROI modeling for ECM projects.
  • There were quite a few sessions on CenterStage. Victor talked about the philosophy behind center stage and that was to separate front end completely from business logic and backend because front end technologies have been changing quite often. I think this is an obvious way and wonder why this was not done in Webtop. He also explained the increasing support for restful apis etc. (See Pie’s post here ).
  • There were also few discussions around Lucene replacing FAST search in EMC’s products.

Open Text acquires Vignette

After Autonomy/Interwoven and Oracle/Sun news, here comes the third big news of the year.

If Unilever can have multiple soaps and GM can have multiple car models, why can’t a Content Management vendor have multiple products? OT’s acquisition of Vignette points to this increasing “commoditization” of Content Management marketplace.

There may be a lot of overlaps in products across OT and Vignette but we all know that one size does not fit all and so why not have different products for different scenarios, different price points, different technology stacks and different requirements?  OT now has multiple options for Document Management, DAM, WCM etc plus a bonus portal server that they lacked before. They had a portal integration kit (PIK) that exposed LiveLink’s functionality as portlets that could be deployed on some of the portal servers (but not VAP and Sun as far as I know).

There’s some good analysis here and here.

On a side note, I think people who worked closely with Vignette knew it coming. A colleague of mine told me this:

One Singapore based vignette customer we were talking to  suddenly went quiet and our sales guy spotted him meeting OpenText. Another one who we were talking to, suddenly decided not to continue with Vignette and decided to migrate to Day communiqué. A senior person in Vignette Singapore joined OpenText about 2-3 months back – and was not replaced. There were many other signs in the way Vignette was handling people and partnerships that showed something is on.

I always considered Interwoven, Vignette and Fatwire (Open Market, Divine and FutureTense before that) as the leaders and pioneers in pure play Web Content space. With Interwoven and Vignette gone, what does this mean for the WCM marketplace? An end of the era?

Content Repositories – Coexistence, Migration and Consolidation

Many of our customers have more that one content repositories. So we often get into situations where there is a need for:

  1. Coexistence: Business requires these multiple repositories to exist simultaneously. This could be because there are different applications for different requirements or because the migration effort is so huge that it is not possible to retire one system immediately. So there is a need for a common interface so that business users can access all the repositories without knowing they are different. They should be able to checkin from one, checkout to another and generally work on multiple repositories as if there was a single backend system.
  2. Migration: Because of multiple reasons (licensing, satisfaction etc), there is a requirement to move content from one system to another. Or perhaps deploying content from a content repository to a delivery channel.
  3. Consolidation: To save costs (licensing, training, infra), they want to consolidate to less number of repositories.

Now obviously, in any of these scenarios, it becomes very important to do content inventory, content analysis and mapping, taxonomy assessment and so on. However, when the content size is huge (some of our customers have terabytes and more of content, that too in the form of huge documents), it becomes important to automate the migration process to the extent possible. What this essentially means is that you need to have an intermediate layer that can talk to source and target repositories and move content across. Depending on whether you want coexistence or migration, the intermediate layer would need to be two way (read and write) or just one way (import or export). I can think of three ways to achieve this:

  1. Roll your own: This possibly provides most flexibility but needs maximum time to develop. You essentially write your own code that exports content from source, does transformation and cleansing and then finally imports it to the target repository. Most decent content management systems provide APIs that can be used in conjunction with code to achieve this.
  2. Use connectors/features provided by CMS vendors: Many CMS vendors provide some mechanism for importing and exporting. They might even provide some way of importing content from “specific” systems.
  3. Use third party tools

I have been doing some research and have come across these vendors who can help you automate this process to a great extent:

EntropySoft

EntropySoft provides an amazing collection of two way connectors for 30 or so different repositories. These connectors have the ability to read from and write to these repositories. They essentially provide two mechanisms:

A Content Federation Server which is a web application. It allows you to configure these repositories and exposes the functionality via a web interface. So using this interface you can access these repositories and your business users will not know that they are different repositories. So as an example, you can check out a policy document from Documentum and check it in FileNet. The same interface also lets you do migration from one to another, create tasks that will automatically migrate as and when a document is updated in one. In the screenshot below, you can see Livelink, Alfresco, FileNet and some other repositories shown.

EntropySoft Web Interface
EntropySoft Web Interface
Configure new repo
Configure new repo

Now this interface is simplistic in the sense that you can do an as-is migration. For more complex migration where you have to transform content, map metadata from source to destination, map permissions, users and roles, it provides an ETL product which is an eclipse based environment. Using this ETL, you can create complex migration processes, using drag and drop.

etl
etl

EntropySoft also works with many search engines for creating federated search applications.

The best thing about EntropySoft is its ease of setup. You can actually get up and running and start an as-is migration (meaning no transformation, no mapping) in just about 15-20 minutes (abt 7-10 minutes for setting up source and destination each). I think where they lag is possibly in terms of having connectors for more web content management systems.

OpenMigrate

OpenMigrate is an open source alternative from TSG.  Currently they have adaptors for Documentum, Alfresco, JDBC, FileSystem. I believe they are probably working on sharepoint and filenet connectors as well.

Vamosa

Vamosa is also a good alternative. To me it appears that their strength lies in web content management. Here’s a list of their connectors. I think Vamosa’s differentiation is that they not only focus on connectors but holistically look at migration. So they have some good products that help you with all the steps that I mentioned above – content inventory, analysis etc and then migration.

Many people have said that with something like CMIS (if and when it becomes a standard), there will be an adverse impact on the connector industry. I actually think it will actually be good for these connector vendors because they would be able to use CMIS instead of relying on proprietary APIs of each repository. Plus I think connecting to a repository is only one, although an important aspect. There is a lot more that goes along with connecting to a repository – transformation, ability to map source data to target repository, reporting, exception management and so on and that is where such products add a lot of value.

Do you know of any other products in this space? What do you think of these?

Autonomy Acquires Interwoven

It was a usual hectic day at work when I read about this sudden interesting development of Interwoven to be acquired by Autonomy. You can read more about it at CMS Watch and CMS Wire.

I felt a bit sad – Interwoven was one of the few pure play CMS vendors and pioneered many of the Content Management concepts. Okay, so the products will still be there but you never know how they evolve in the context of a new setup. A lot of attention  is now on Vignette, the other major CMS vendor. I wonder, why is no one talking of Fatwire?

Lee Dallas calls this a consolidation in a different direction. Most others in this space have been with infrastructure vendors or with other related vendor. So in that sense, this brings in a unique differentiation for both these vendors. What could be interesting in this context is what now happens to Autonomy’s relationships with other CMS vendors. Many CMS vendors had integrations and OEM relationships with Autonomy and those will probably get redefined now. Similarly, Interwoven’s partnership with other search vendors (like FAST) will probably also get reviewed.

Even though Autonomy is known more for its search products, it also has offerings for BPM (Cardiff), Records Management (Meridio) and Digital Assets (Virage). So it would be interesting to see how and when overlaps are rationalized with Interwoven’s MediaBin, WorkSite and other related offerings.

In other interesting news this week, Alfresco released the final version of Alfresco 3 Labs, which among other things has Web Studio, a designer tool to build web application. But that is a topic of another post.

Goodbye 2008, Welcome 2009

Okay so another year comes to an end and while we welcome the new year, here’s a look at some of the themes (in a random order) of the year gone by that might have an impact on the Content Technologies next year.

Verticalized Applications

Content Management Systems as horizontal solutions have been there for long and most known vendors provide similar features. The industry however is asking for more domain specific solutions built on standard CMS repositories. Based on this demand and the fact that this provides a differentiation to CMS vendors, I hope to see more and more domain or vertical specific solutions like Loan Origination, Claims Processing and other similar solutions/accelerators from many CMS vendors. Also, with the slowdown in economy, it is easier to sell a domain solution than a pure horizontal solution.

Portal and Content Consolidation

Many enterprises struggle with multitude of applications doing overlapping functionality. Organizations have multiple CMS repositories and many portals. This often leads to duplication of content varied user experience and huge costs. Because of huge cost pressures, many organizations have been considering consolidation of their content applications.

This will lead to following benefits:

  • Reduced Hardware Infrastructure as you don’t need those 5 different ECM repositories
  • Reduced employee costs as you do not need skilled people across 5 different portal servers
  • Standardized processes and hence increased productivity
  • Reduced employee training costs
  • Unified User Experience
  • Reduced Integration, Maintenance and Support Costs

I believe this could be a very important way to reduce and control costs as well as bringing in some standardization. So many organizations would start focused initiatives to consolidate their existing applications.

Open Source

Open Source Content Management and Portal solutions have matured quite a bit. Because of this and the fact that there is cost pressure on everyone, enterprises that would not even consider Open Source solutions are now more favorable towards them. They are becoming open to experimenting with technologies that are generally not considered *enterprisey*.  Many of the open source products are being tracked by waves and quadrants of major analysts and  that reflects a huge change. This is also good for the Open Source vendors because many enterprises use these analysts’ reports for shortlisting.  Many open source products have also released commercial versions and that is another reason that gives these vendors a foot hold within enterprises who did not want to use these citing lack of support options.

Another factor that encourages the use of Open Source products is that people want to quickly build “informal” applications which many commercial products can not do well. There are many popular Open Source (and free) products that do certain things much better.

Although, initial cost could reduce by using Open Source, organizations should carefully look at the impact over a longer horizon and should consider Open Source as another alternative in the market place. They should select Open Source based on overall fitment to their requirements and not just make a decision based on initial licensing cost.

Web 2.0

Widgets and Gadgets have been popular for quite sometime. Some products had gadgets much before portlet spec. I am sure many people have seen examples of counters, ad banners etc which are essentially widgets only. However, there is a considerable interest now in using these within the enterprises for more sophisticated portal like applications.

Currently, most social networking is horizontal – you become a member of a social network, I become one and we write scraps on each other. What next?  I believe Vertical Social Networking is becoming popular.  Some areas where we already see this or have potential are in the areas of Jobs, Real Estate and Classifieds. After all, It is easier to buy an old laptop from a contact’s contact rather than an unknown person who’s advertised in classifieds.

In order to reduce cost, many enterprises, especially those that require product support want to leverage the communities for customer support. They want people to help each other and come to their support only as a last resort. What this means is increasing use of tools that enable collaboration – wikis for example. Many enterprises are using these communities not just for support but also as a way to generate revenues.

Some organizations are also using web 2.0 as a means to Knowledge Management. Instead of regular process oriented KM which forces people to contribute, they want to use mechanisms that encourage people who in turn want to contribute. This is a huge shift – people don’t like contributing if they are forced to do it but are likely to contribute if they enjoy doing it. This also means a shift from “control and process” to “informality and accessibility”.

In spite of all this, I still think how to use Web 2.0 within the enterprise is still not very clear to many organizations and there is a huge scope for improvement. One of the reasons people cite is that workforce is used to applications that became successful on the consumer Internet and want to have same kind of experience for enterprise applications but they need to be very careful. Here’s a nice post by Vilas.

Alternate Delivery Models

There is more acceptance for SaaS based offerings. This is especially true for applications that are not business mission critical. Businesses are experimenting with SaaS based providers because this saves them dependence on their internal IT apart from other benefits like faster time to market, no capital expenditure, low risk and so on. Along with this,  alternate pricing models are also being looked at. Some examples are pay per document, pay per loan, pay per claim etc.

Standards

The portlet spec 2.0 or JSR 286 was released. Although the portlet standards (JSR 286 and JSR 168) have been relatively successful in terms of adoption and support, the content repository standard, JSR 170 has not been that popular. Meanwhile, vendors are collaborating on technologies that will help customers reuse existing investments. As an example, many vendors have come up with CMIS. Okay this is not a standard yet but is possibly in that direction. A standard like this is very much needed and hopefully CMIS will achieve what JSR-170/283 did not.

I would also hope that a standard emerges for Gadgets/Widgets.

Site Management and Personalization

Traditionally Content Management was decoupled from Site Management. However, marketing and business people now want more control and there is increasing convergence of Content Management and Site Management. This essentially means better user experience, rich and dynamic sites. This also means features like personalization are making a come back. This has also resulted because of cheap bandwidth and better client side technologies

Document Services

Document Composition and Generation is becoming part of mainstream ECM. There have been a few partnerships as well as mergers in this space. Related terms in this space are Document Output Management and Forms Management.

This was probably the last post of this year. Thanks for reading the blog and here’s wishing you a great year ahead.

CMIS – Yet another acronym or more than that?

Content Management Interoperability Services (CMIS) is a new standard that (from the spec)

… will define a domain model and set of bindings, such as Web Service and REST/Atom that can be used by applications to work with one or more Content Management repositories/systems.

This spec  will soon be submitted to OASIS. It has participation from IBM, EMC, Microsoft, Open Text, Oracle, SAP and Open Source Alfresco.

Around the time when JSR 170 was released, I had written that many products have proprietary repositories and it might not be trivial for them to re-architect those to be JCR compliant. This seems to be an important consideration of this spec and thus CMIS is designed to be an abstraction over existing systems. So it does not require the products to make any major changes to their  architecture. It does not even try to make it mandatory to expose ALL features via CMIS.

There is also a recognition of the fact that many organizations indeed have multiple ECM systems and it is going to remain like that. So it might not be possible for everyone to consider migration and/or consolidation to a common repository.

Above all, it has support from Microsoft. And with a focus on REST, HTTP, ATOM it has that distinct feel of web 2.0, content mashups and so on.

So what does it mean for JCR? I’d like to believe what Kas Thomas has written on CMS Watch based on his interaction with David Nuescheler. In fact, the first ever draft implementation of CMIS is based on a JCR (Alfresco)! However, buyers of new ECM systems will now be less enthusiastic about the “support for JSR 170 tick mark” in their RFPs and that will mean reduced pressure on product vendors to support the JCR standard.

Also there is something that i’m trying to figure out and i’m hoping the experts can point me to something. All the diagrams, including the one here show how this spec aims to improve interoperability among different ECM systems by having an application that can access any CMS. However, doesn’t interoperability also mean interaction between the participating CMSs as well – for example, if CMIS enabled EMC Documentum and FileNet are involved and i check out a document in Documentum, the FileNet users will also see that document checked out. Or does this use case not make any sense? We have seen a lot of scenarios where a customer has multiple ECM systems and they want this ability via a common interface.