Elasticsearch and FSCrawler with Owncloud

Posted
Comments None

If you have ever wanted a search engine for your own files, now you can have it, thanks to a few open source projects. This project will use Elasticsearch, and FSCrawler to provide the search function, and the file indexing. FSCrawler uses Tesseract for text recognition. For the front end, I was initially thinking of making my own php/html, but that would become very complicated in order to deal with the file permissions (to have any application layer security). If you just have internally used files that are shared among everyone on the LAN, then a simple search interface would work great. If you have multiple users with personal files, then ownCloud is the way to go. ownCloud is a ready made platform, that has all of the application security implemented, as well as a search function (and alot of other great features). I modified ownCloud to use the elastic search index for searches. This add’s text recognition from images, and massively speeds up the searches (takes less time than a search on your favorite commercial search engine). I will add that what I have done is very simple, and it could use improvements, and I wouldn’t recommend it in a large environment, because something will probably not work, and people will be upset at you. For professional environments, you can purchase the enterprise version of ownCloud, and get their improved search functionality. To do this, just follow the below steps, assuming you have Ubuntu: 1. Install ownCloud using docker-compose Install Docker
sudo apt install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt updatesudo apt install docker-ce
service docker start
Install Docker Compose
sudo curl -L "https://github.com/docker/compose/releases/download/1.23.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
  1. Create a new project directory
    mkdir owncloud-docker-server

cd owncloud-docker-server

  1. Copy docker-compose.yml from the GitHub repository
    wget https://raw.githubusercontent.com/owncloud-docker/server/master/docker-compose.yml
  1. Create the environment configuration file
    cat << EOF > .env
    OWNCLOUD_VERSION=10.0
    OWNCLOUD_DOMAIN=localhost
    ADMIN_USERNAME=admin
    ADMIN_PASSWORD=admin
    HTTP_PORT=8000
    EOF
  1. Build and start the container
    docker-compose up -d

If everything worked, owncloud will be running on port 8000. Now we will move on to
installing Elasticsearch, and FS Crawler in order to index, and search all of your files. First, install Java, you will need to agree to the license.
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
nano /etc/environment

Add a line to that file that looks like this:
JAVA_HOME="/usr/lib/jvm/java-8-oracle"
source /etc/environment
wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
echo "deb https://artifacts.elastic.co/packages/6.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-6.x.list
sudo apt-get update && sudo apt-get install elasticsearch
sudo service elasticsearch start
Install tesseract for OCR.
sudo add-apt-repository universe
apt-get install tesseract-ocr
Install FS Crawler. Download a zip from here.
wget https://oss.sonatype.org/content/repositories/snapshots/fr/pilato/elasticsearch/crawler/fscrawler-es6/2.7-SNAPSHOT/fscrawler-es6-2.7-20190302.182900-24.zip
unzip fscrawler-es6-2.7-20190211.102128-17.zip
cd fscrawler-es6-2.7-SNAPSHOT/
bin/fscrawler indexname
nano /home/user/.fscrawler/indexname/_settings.yaml
Edit that file to look like this, and modify the url to the location where all the files are stored:
---
name: "indexname"
fs:
  url: "/mnt/big/files/indexname/files/"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: true
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  pdf_ocr: true
  ocr:
    language: "eng"
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
rest:
  url: "http://127.0.0.1:8080/fscrawler"
Then run fscrawler
bin/fscrawler indexname
Check if it’s adding to elasticsearch
curl http://localhost:9200/_cat/indices
Now modify the owncloud search function to use elasticsearch. If you ran docker-compose up in the section for installing owncloud, then it will already be running. To modify it, we need to get bash/shell access to the container. To do this, type:
docker ps
If you only have ownCloud running on docker, there should be 3 containers. We will be changing the owncloud container. To do so, copy the container ID from the container using the “owncloud/server:” image, and type the following with your own container id:
docker exec -it dfbf7fa1b53d /bin/bash
That should put you on the container as root, at /var/www/owncloud. We will modify the code for the search function to have it look to elastic search, rather than the original ownCloud mechanism. It then maps the results into the format that ownCloud is looking for. The way I have done it is quite simple, so it removes some of the ownCloud search result functionality. I only return one type of result, files. I should also have a blurb about security. The whole point of me using ownCloud for this, is that it controls access to the files. However, elastic search will now have a copy of all the file content in it. Access to this can be restricted with user accounts, and/or by only having elastic search listen on the loopback interface, and the docker interface for ownCloud. If you used the docker-compose script above, the ownCloud address should be 172.18.0.4, and the interface on the host machine should be 172.18.0.1, so you can tell docker to listen on that address. Then each user can have their own index/fscrawler job. OwnCloud can then query a different index depending upon who is logged in. In summary, this code will get search results from elastic search, using the username as the index name, and map those to the format ownCloud uses. Edit /var/www/owncloud/lib/private/Search/Provider/File.php Make it look like this:
author Andrew Brown <andrewcasabrown.com>
 * author Bart Visscher <bartvthisnet.nl>
 * author Jakob Sack <mailjakobsack.de>
 * author Jörn Friedrich Dreyer <jfdbutonic.de>
 * author Morris Jobke <heymorrisjobke.de>
 * author Thomas Müller <thomas.muellertmit.eu>
 *
 * copyright Copyright (c) 2018, ownCloud GmbH
 * license AGPL-3.0
 *
 * This code is free software: you can redistribute it and/or modify
 * it under the terms of the GNU Affero General Public License, version 3,
 * as published by the Free Software Foundation.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
 * GNU Affero General Public License for more details.
 *
 * You should have received a copy of the GNU Affero General Public License, version 3,
 * along with this program.  If not, see 
 *
 */

namespace OC\Search\Provider;
use OC\Files\Filesystem;

/** * Provide search results from the ‘files’ app */
class File extends \OCP\Search\Provider {

/** * Search for files and folders matching the given query * param string $query * return \OCP\Search\Result */ public function search($query) { //$files = Filesystem::search($query); $sess= \OC::$server->getUserSession(); $user = $sess->getUser(); $userName = $user->getDisplayName(); $results = []; $json = file_get_contents(“http://172.18.0.1:9200/”.$userName.”/_search?q=”.$query.”&size=200”); $lov = json_decode($json, true); //Create a new array for the results that fits with the ownCloud array //For $lov as $each foreach ($lov[“hits”][“hits”] as $each) { $fileData = array(); $fileData[“type”] = “file”; $path = explode( “/mnt/big/files/”.$userName,$each[“_source”][“path”][“real”])1; $fileData[“path”] = $path; $fileData[“size”] = $each[“_source”][“file”][“filesize”]; $fileData[“modified”] = $each[“_source”][“file”][“filesize”]; $fileData[“mime_type”] = $each[“_source”][“meta”][“content_type”]; $fileData[“permissions”] = 27; $fileData[“id”] = ‘’; $fileData[“name”] = $each[“_source”][“file”][“filename”]; $path = substr($path,6); //Strip the /files off of the beginning of the filename $fileData[“link”] = “/remote.php/webdav”.$path; $fileData[“mime”] = $each[“_source”][“file”][“content_type”]; $results[] = $fileData; } return $results; } }
There you have it, it’s quite complicated to setup, but once it’s running, it works great! The few improvements I would like to make would be to run fscrawler as a service, and automate the addition of fscrawler instances with users. Also, improving the search results display, to regain the original functionality from ownCloud that displayed different items for media would be nice.

Author

Comments

There are currently no comments on this article.

Comment

Enter your comment below. Fields marked * are required. You must preview your comment before submitting it.





← Older Newer →