First steps with enterprise search

Since my first day in TBSCG I have been dealing with search engines. In this article I would like to share some of the acquired knowledge regarding this, at least for me 😛 , exciting area of IT.

Data, data, data… big one

I don’t think anyone from the IT world has not heard about Big Data yet. I am going to use this in order to introduce you in this world.

As we know, there is a lot of data over there, plenty of them. Created by publishers, by companies, by users and a lot of self-generated logs, traces and so on. It is also growing every day. How do we manage this? Trying to solve this problem is what Big Data is about, but the concept is still too much wide to be useful in any way. Today, though, there are three very well defined growing areas that exist only because of the big data problem: Business Intelligence, Cloud Systems (including those huge NoSQL databases), and, the present case, Enterprise Search. Anyway, think that it is a still a niche market, but with great expectations of future growth.

A search engine is a software product made to index or organize information. It allows not only to find a document (like a web page, a word document, or a row in a database) but also to retrieve information from it like entities (or language concepts; i.e. names, addresses, dates, etc.), sentiments (positive/negative, at least) or anything what the data-mining science and data-scientists can. Of course not only the text can be indexed. Remember: we are talking about data! So images, videos and audio streams are also available to be indexed, and then found, analyzed and connected to other sources.

In the enterprise world we want to do the same, but with the data of a specific company. When a system like this is implemented, the intention is to do “like” Google does with the WWW. But, of course, we don’t have Google’s budget. Besides, let’s be fair: the web is much easier to index and to structure (having standards for both the document structure and link definition would make the life much easier, indeed).

There are a couple of commercial and open-source solutions available to implement. Google has its own, of course, but there are also HP, IBM, Microsoft, Oracle and many others. In the open source side the most famous, almost standard, option is Apache Solr, based on the Lucene engine. A newer and fancier alternative is ElasticSearch.

You should know something: there’s not much secret in what a search engine should do and cannot do (as I said, it is more a problem for mathematicians and data scientists). All of the alternatives are quite similar and do mostly the same, with small differences. What makes the difference between good search results is not the product itself, but the in-site implementation and how well-adapted is to the company structure and needs.

Here at TBSCG we work with HP Autonomy solution, called IDOL (Intelligent Data Operating Layer), and, even though you can’t have a working copy at home, there is a much nicer way to try it without installation, on the cloud 🙂

IDOL OnDemand

Now let’s get started and do something useful! Go to https://www.idolondemand.com/ and Sign Up for an account in order to try the free version of IDOL.

Once you are registered, please go to https://www.idolondemand.com/developer/apis . This will give you a general idea of what you can do with IDOL. The most important are Connectors and Indexes.

As a concept, a Connector is a software module that “connects” to a source and “sends” the information from that source to an Index. They are necessary to retrieve the information from those unstructured places, and that is why every kind of information has a different connector. In the other hand, Indexes are the containers of such information, and there are also different types of indexes.

It is possible to connect many Connectors to an index, or each connector to a different index. Then, the queries are sent to a specific index, or to all of them. Everything depends on what you want to do; your needs.

I have drawn a simple example schema where 3 connectors are connected to 2 indexes that receive 2 types of queries:

schema

Nowadays IDOL OnDemand supports two types of connectors: web, for indexing web pages, and file system, for indexing files. There is only one type of index: text.

If you check any IDOL API, you would notice that the connection is made through a simple REST interface. That means you can connect any application using any language just using a REST interface, which is great 😀

They have plenty of tutorials in the community site, but I am going to show a small sync interface made in Java and the Unirest library that can be used for anything. You can download the full code here.

First, set your API identifier on top of the file. Get yours from https://www.idolondemand.com/account/api-keys.html

private final String apikey = "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXXX"; //Own key

Now, you can check the most important functions for creating an index and a connector:

public String createIndex(String name, String description)
{
 return runQueryStringGET("createtextindex/v1?index=" + name + "&flavor=standard&description=" + description);
 }
public String createWebConnector(String name, String URL, String index, String description)
{
 return runQueryStringGET("createconnector/v1?connector=" + name
 + "&type=web&config=%7B%22url%22%3A%22http%3A%2F%2F" + URL + "%22%7D&"
 + "destination=%7B%22action%22%3A%22addtotextindex%22,%22index%22%3A%22" + index + "%22%7D&"
 + "description=%22" + description + "%22");
 }

In order to start doing something, a simple query:

public String queryText(String query){
 return runQueryStringGET("querytextindex/v1?text=" + query);
 }

Calling the client is easy:

IDOLOnDemandClientSync cl1 = new IDOLOnDemandClientSync();
 cl1.createIndex("tbscg_web", "");
 cl1.createWebConnector("tbscg_connector", "www.tbscg.com", "tbscg_web", "");
 cl1.queryText("cms");

Remember there is a limited number of connectors and indexes you can have. You can check the usage of your quotas in https://www.idolondemand.com/account/quotas.html

In the future I will go deeper into the IDOL OnDemand configuration and functionalities. HP is always adding new ones and improving the platform.

Here you can find more examples for other programming languages:
https://community.idolondemand.com/t5/Wiki/tkb-p/tkb_idol

And here more tutorials: