Elasticsearch 既然作为一个全文检索引擎,那么自然需要将数据导入,让 Elasticsearch 去索引。

Elasticsearch(后简写为 ES) 的基本单元是文档,使用 JSON 来描述。

有很多方法可以把数据导入到 ES:

  • RESTful 接口
  • Bulk API 批量导入
  • elasticsearch-dump
  • Logstash 将收集的数据导入

Prerequisite

导入数据前要了解的知识。

  • Cluster,集群,通常由多个节点组成 ES 集群
  • Index,通常称为索引,文档的属性
  • Document,文档,JSON 格式定义的数据
  • Shard,分片,索引会水平分片
  • Replicas,ES 允许用户创建索引和分片的 Replicas

RESTful 接口导入

如果数据文件比较简单,只有单层 JSON 结构,并且小于 1MB,可以使用 POST 请求直接将数据提交到 ES。

假设有数据 accounts.json

{"account_number":1,"balance":39225,"firstname":"Amber","lastname":"Duke","age":32,"gender":"M","address":"880 Holmes Lane","employer":"Pyrami","email":"amberduke@pyrami.com","city":"Brogan","state":"IL"}

然后可以使用如下命令导入:

curl -u admin:my_elastic_pass -H 'Content-Type: application/json' -XPOST 'http://localhost:9200/accounts/_doc/_bulk?pretty' --data-binary @accounts.json

返回:

{
  "_index" : "accounts",
  "_id" : "_bulk",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

可以通过如下命令来查看所有索引的情况:

curl "localhost:9200/_cat/indices?v"

然后访问 Kibana,在后台就可以查询导入的内容:

kibana

Bulk

这里使用官方的样例数据:

wget https://download.elastic.co/demos/kibana/gettingstarted/accounts.zip
unzip accounts.zip
curl -XPOST 'localhost:9200/bank/account/_bulk?pretty' --data-binary @accounts.json

需要注意的是如果使用 Bulk 批量导入,那么格式需要按照:

action_and_meta_data\n
optional_source\n
action_and_meta_data\n
optional_source\n
....
action_and_meta_data\n
optional_source\n

或者可以在 Kibana 后台,点击侧边栏 Dev Tools,然后在编辑框中输入:

POST bank/_bulk
{"index":{"_id":"1"}}
{"account_number":1,"balance":39225,"firstname":"Amber","lastname":"Duke","age":32,"gender":"M","address":"880 Holmes Lane","employer":"Pyrami","email":"amberduke@pyrami.com","city":"Brogan","state":"IL"}
{"index":{"_id":"6"}}
{"account_number":6,"balance":5686,"firstname":"Hattie","lastname":"Bond","age":36,"gender":"M","address":"671 Bristol Street","employer":"Netagy","email":"hattiebond@netagy.com","city":"Dante","state":"TN"}
省略...

最后点击执行。

kibana devtool bulk import

执行成功后,可以运行 curl localhost:9200/bank/_search,如果返回值中 value 是导入的条数就表示成功了。

elasticsearch-dump

Logstash

Logstash 是开源的服务器端数据处理管道,能够同时从多个来源采集数据、转换数据,然后将数据发送到 Elasticsearch 中。Logstash 的官方文档请参见:https://www.elastic.co/guide/en/logstash/current/getting-started-with-logstash.html

reference

  • <https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html