七三笔记

es分词安装

 
不同版本下载
https://www.elastic.co/guide/en/elasticsearch/reference/6.5/es-release-notes.html

【本次安装参考】
http://blog.51cto.com/moerjinrong/2310817

分词安装要求分词的插件的版本与ES版本号完全一致，因此要先看一下分词的版本与ES的版本

本次安装为v6.5.0,es、ik、head
https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.5.0.tar.gz
https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.5.0/elasticsearch-analysis-ik-6.5.0.zip
https://github.com/mobz/elasticsearch-head/archive/v5.0.0.tar.gz


官方文档
https://www.elastic.co/guide/index.html

https://www.elastic.co/guide/en/elasticsearch/reference/6.5/release-notes-6.5.0.html


ES7 
Downloads: https://elastic.co/downloads/elasticsearch
Release notes: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/release-notes-7.17.20.html

历史版本下载 
https://www.elastic.co/downloads/past-releases#elasticsearch

https://www.elastic.co/downloads/past-releases/elasticsearch-7-17-20

https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.17.20-linux-x86_64.tar.gz

下载

 
依赖JDK 
    
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.5.0.tar.gz --no-check-certificate

wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.5.0/elasticsearch-analysis-ik-6.5.0.zip --no-check-certificate

wget https://github.com/mobz/elasticsearch-head/archive/v5.0.0.tar.gz --no-check-certificate


wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.17.20-linux-x86_64.tar.gz --no-check-certificate

 
三节点集群安装 
- ES是以集群方式运行的，至少需要两个节点
- 不需要配置SSH 
- 依赖JDK

为docker划分一个子网段，仅限于该服务器内使用
docker network rm  mydk
docker network create --subnet=192.168.73.0/24 mydk

 
docker run -itd --privileged --name es1 -h es1 --net mydk --ip 192.168.73.11 -v /opt:/opt -v /tmp:/tmp -v /mnt:/mnt -v /media:/media -p 13301:13301 cent7  /usr/sbin/init

docker exec -it es1 bash

### 依赖安装
yum install -y net-tools libaio numactl
yum -y install gcc gcc-c++ autoconf make
yum install openssl-devel bzip2-devel


docker run -itd --privileged --name es2 -h es2 --net mydk --ip 192.168.73.12 -v /opt:/opt -v /tmp:/tmp -v /mnt:/mnt -v /media:/media -p 13301:13301 cent7  /usr/sbin/init

docker exec -it es2 bash


docker run -itd --privileged --name es3 -h es3 --net mydk --ip 192.168.73.13 -v /opt:/opt -v /tmp:/tmp -v /mnt:/mnt -v /media:/media cent7  /usr/sbin/init

docker exec -it es3 bash

JDK

 
export JAVA_HOME=/opt/app/jdk-11
export CLASSPATH=.:$JAVA_HOME/lib/tools.jar:$JAVA_HOME/lib/dt.jar:$CLASSPATH
export PATH=$JAVA_HOME/bin:$PATH

节点1

 
mkdir -p /data/es/{app,data,logs}
rsync -rltDv /media/xt/tpf/soft/es/ /data/es/app/

cd /data/es/app/
tar -zxvf elasticsearch-6.5.0.tar.gz

mkdir /data/es/app/elasticsearch-6.5.0/plugins/ik
unzip elasticsearch-analysis-ik-6.5.0.zip -d /data/es/app/elasticsearch-6.5.0/plugins/ik

ls /data/es/app/elasticsearch-6.5.0/plugins/ik
commons-codec-1.9.jar    config                               httpclient-4.5.2.jar  plugin-descriptor.properties
commons-logging-1.2.jar  elasticsearch-analysis-ik-6.5.0.jar  httpcore-4.4.4.jar    plugin-security.policy

 
echo "
xt soft nofile 655350
xt hard nofile 655350
xt soft nproc 655350
xt hard nproc 655350
xt soft memlock -1
xt hard memlock -1
" >>  /etc/security/limits.conf 

cat  /etc/security/limits.conf 


ll /etc/security/limits.d/20-nproc.conf

echo "
xt      soft    nproc     655350
">> /etc/security/limits.d/20-nproc.conf
cat /etc/security/limits.d/20-nproc.conf

echo "
xt      soft    nproc     655350
">> /etc/security/limits.d/90-nproc.conf
cat /etc/security/limits.d/90-nproc.conf


echo "
vm.max_map_count=262144
">> /etc/sysctl.conf
sysctl -p

本次在docker中安装，ssh通信失败，但es通信成功，原因未知。

 
各个节点执行 
yum install openssh-server

adduser xt 
su - xt 
ssh-keygen -t rsa   

各个节点执行除本节点外的两个两个命令
ssh-copy-id -i ~/.ssh/id_rsa.pub  192.168.73.11
ssh-copy-id -i ~/.ssh/id_rsa.pub  192.168.73.12
ssh-copy-id -i ~/.ssh/id_rsa.pub  192.168.73.13

如果采用一机安装三个节点，就不需要配置互信了

猜测原因

 
对于ES来说，通信主要使用http端口，没有使用ssh服务，
因此不需要配置互信，可以解决通信问题。

 
一个节点配置好后，再复制到其他节点
vim /data/es/app/elasticsearch-6.5.0/config/elasticsearch.yml 

cluster.name: my-application
node.name: node-1

path.data: /data/es/data/
path.logs: /data/es/logs/

network.host: 192.168.73.11
http.port: 9200
discovery.zen.ping.unicast.hosts: ["192.168.73.11", "192.168.73.12","192.168.73.13"]
discovery.zen.minimum_master_nodes: 2

 
将文件复制到其他节点
mkdir -p /data/es/{app,data,logs}
scp -r xt@192.168.73.11:/data/es/app/elasticsearch-6.5.0 /data/es/app

chown -R xt.xt /data/es

其他节点对 elasticsearch.yml 修改如下
vim /data/es/app/elasticsearch-6.5.0/config/elasticsearch.yml 
node.name: node-2
network.host: 192.168.73.12

 
rsync -rltDv /data/es/app/elasticsearch-6.5.0 /tmp/


mkdir -p /data/es/{app,data,logs}
chown -R xt.xt /data/es 

rsync -rltDv /tmp/elasticsearch-6.5.0 /data/es/app/

chown -R xt.xt /data/es

 
http://192.168.73.11:9100

【后台启动】
cd /data/es/app/elasticsearch-6.5.0
nohup ./bin/elasticsearch > /data/es/logs/start.log 2>&1 &
tailf /data/es/logs/start.log
或
./bin/elasticsearch -d

第一个节点启动时会报以下信息，第二个节点启动后就好了
not enough master nodes discovered during 

第二个节点启动后会有加入集群的信息，第三个节后则没有该信息，因为此配置文件中主节点个数为2
[node-2] recovered [0] indices into cluster_state

【关闭】
使用启动用户杀即可
ps -ef |grep ela
kill -9 进程号

 
在浏览器中访问
http://192.168.73.11:9200/


创建一个索引
curl -XPUT http://192.168.73.11:9200/index
{"acknowledged":true,"shards_acknowledged":true,"index":"index"}


创建一个映射
curl -XPOST http://192.168.73.11:9200/index/fulltext/_mapping -H 'Content-Type:application/json' -d'
{
        "properties": {
            "content": {
                "type": "text",
                "analyzer": "ik_max_word",
                "search_analyzer": "ik_max_word"
            }
        }

}'

{"acknowledged":true}


索引一些文档
curl -XPOST http://192.168.73.11:9200/index/fulltext/1 -H 'Content-Type:application/json' -d'
{"content":"时间是一切财富中最宝贵的财富"}
'
curl -XPOST http://192.168.73.11:9200/index/fulltext/2 -H 'Content-Type:application/json' -d'
{"content":"世界上一成不变的东西，只有“任何事物都是在不断变化的”这条真理。"}
'

curl -XPOST http://192.168.73.11:9200/index/fulltext/3 -H 'Content-Type:application/json' -d'
{"content":"要使别人喜欢你，首先你得改变对人的态度，把精神放得轻松一点，表情自然，笑容可掬，这样别人就会对你产生喜爱的感觉了。——卡耐基"}
'

curl -XPOST http://192.168.73.11:9200/index/fulltext/4 -H 'Content-Type:application/json' -d'
{"content":"君子在下位则多谤，在上位则多誉；小人在下位则多誉，在上位则多谤。——柳宗元"}
'

curl -XPOST http://192.168.73.11:9200/index/fulltext/5 -H 'Content-Type:application/json' -d'
{"content":"一个不注意小事情的人，永远不会成功大事业。——卡耐基"}
'
{"_index":"index","_type":"fulltext","_id":"5","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":0,"_primary_term":3}


查看
curl -XPOST http://192.168.73.11:9200/index/fulltext/_search?pretty  -H 'Content-Type:application/json' -d'
{
    "query" : { "match" : { "content" : "卡耐基" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}
'
查询显示
{
  "took" : 307,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "index",
        "_type" : "fulltext",
        "_id" : "5",
        "_score" : 0.2876821,
        "_source" : {
          "content" : "一个不注意小事情的人，永远不会成功大事业。——卡耐基"
        },
        "highlight" : {
          "content" : [
            "——卡耐基"
          ]
        }
      },
      {
        "_index" : "index",
        "_type" : "fulltext",
        "_id" : "3",
        "_score" : 0.2876821,
        "_source" : {
          "content" : "要使别人喜欢你，首先你得改变对人的态度，把精神放得轻松一点，表情自然，笑容可掬，这样别人就会对你产生喜爱的感觉了。——卡耐基"
        },
        "highlight" : {
          "content" : [
            "——卡耐基"
          ]
        }
      }
    ]
  }
}

es7分词安装

 
es7可以单节点安装 

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.17.4-linux-x86_64.tar.gz --no-check-certificate

wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.17.4/elasticsearch-analysis-ik-7.17.4.zip --no-check-certificate
   
wget https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v7.17.4/elasticsearch-analysis-pinyin-7.17.4.zip

扩展阅读：

Elasticsearch（简称ES）是一个广泛应用的开源搜索引擎: https://www.elastic.co/

关于ES的安装、部署等知识，网上可以找到大量资料，例如: https://juejin.cn/post/7104875268166123528

关于经典信息检索技术的更多细节，可以参考: https://nlp.stanford.edu/IR-book/information-retrieval-book.html

 
安装前面安装的es1,es2,es3 
    
su - xt
cd /data/es/app
rsync -rltDv /tmp/es7/elasticsearch-7.17.4-linux-x86_64.tar.gz ./
tar -xvf elasticsearch-7.17.4-linux-x86_64.tar.gz

discovery.seed_hosts: 集群主机列表
cluster.initial_master_nodes: 启动时初始化的参与选主的node，生产环境必填


vim elasticsearch-7.17.4/config/elasticsearch.yml
cluster.name: my-application
node.name: node-1
path.data: /data/es/data/
path.logs: /data/es/logs
network.host: 192.168.73.11
http.port: 9200
discovery.seed_hosts: ["192.168.73.11", "192.168.73.12"]
cluster.initial_master_nodes: ["node-1", "node-2"]


./bin/elasticsearch -d

https://www.cnblogs.com/Likfees/p/16449224.html

分词器

 
cd elasticsearch-7.17.4/plugins/
rsync -rltDv /tmp/es7/elasticsearch-analysis-ik-7.17.4.zip ./
unzip -d ik elasticsearch-analysis-ik-7.17.4.zip
rm elasticsearch-analysis-ik-7.17.4.zip

 

wget https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v7.17.4/elasticsearch-analysis-pinyin-7.17.4.zip

rsync -rltDv /tmp/es7/elasticsearch-analysis-pinyin-7.17.4.zip ./
unzip -d pinyin elasticsearch-analysis-pinyin-7.17.4.zip

rm elasticsearch-analysis-pinyin-7.17.4.zip

 
rsync -rltDv /data/es/app/elasticsearch-7.17.4 /tmp/

rsync -rltDv /tmp/elasticsearch-7.17.4 /data/es/app/

 
vim elasticsearch-7.17.4/config/elasticsearch.yml
cluster.name: my-application
node.name: node-2
path.data: /data/es/data/
path.logs: /data/es/logs
network.host: 192.168.73.12
http.port: 9200
discovery.seed_hosts: ["192.168.73.11", "192.168.73.12"]
cluster.initial_master_nodes: ["node-1", "node-2"]

./bin/elasticsearch -d

 
[xt@es2 elasticsearch-7.17.4]$ netstat -tunlp           
(Not all processes could be identified, non-owned process info
  will not be shown, you would have to be root to see it all.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 192.168.73.12:9300      0.0.0.0:*               LISTEN      213/java            
tcp        0      0 192.168.73.12:9200      0.0.0.0:*               LISTEN      213/java            
tcp        0      0 127.0.0.11:43867        0.0.0.0:*               LISTEN      -                   
udp        0      0 127.0.0.11:47013        0.0.0.0:*                           -                   
[xt@es2 elasticsearch-7.17.4]$

本次没有按下面的配置进行，集群依然起来，怀疑discovery.zen.ping.unicast.hosts是es6中的配置，es7中不需要了

 
cluster.name: kkb-es
node.name: node-0  
node.master: true
network.host: 0.0.0.0
http.port: 9200
transport.tcp.port: 9300 # tcp 端口
discovery.zen.ping.unicast.hosts: ["192.168.147.66:9300","192.168.147.67:9300","192.168.147.68:9300"]
discovery.zen.minimum_master_nodes: 2
http.cors.enabled: true
http.cors.allow-origin: "*"

后续安装直接解压，然后修改配置文件即可

 
tar -zcvf elasticsearch-7.17.4_ok.tar.gz elasticsearch-7.17.4/
mv elasticsearch-7.17.4_ok.tar.gz /media/xt/tpf/soft/es7/

JDK

 
export JAVA_HOME=/opt/app/jdk-11
export CLASSPATH=.:$JAVA_HOME/lib/tools.jar:$JAVA_HOME/lib/dt.jar:$CLASSPATH
export PATH=$JAVA_HOME/bin:$PATH

系统配置

解压安装

 
mkdir /data/es 
cd /data/es 
rsync -rltDv /media/xt/tpf/soft/es7/elasticsearch-7.17.4_ok.tar.gz ./ 
tar -xvf elasticsearch-7.17.4_ok.tar.gz

配置

 
mkdir -p /data/es/data/
mkdir -p /data/es/logs/
    
单节点配置
vim config/elasticsearch.yml
network.host: 127.0.0.1
discovery.seed_hosts: ["127.0.0.1"]
cluster.initial_master_nodes: ["node-1"]

启动

 
./bin/elasticsearch -d

cat config/elasticsearch.yml

 
cluster.name: my-application
node.name: node-1
path.data: /data/es/data/
path.logs: /data/es/logs
network.host: 127.0.0.1
http.port: 9200
transport.tcp.port: 9300
discovery.seed_hosts: ["127.0.0.1"]
cluster.initial_master_nodes: ["node-1"]

 
如果IP配置为127.0.0.1就只能本地访问

如果想要外部访问，就必须配置具体的IP 
- 比如windows中的ubantu系统，
- 要想在windows中访问ubantu中的es，那么es配置的IP就必须写对外的IP，比如 172.31.150.83

python连接es

 
pip install elasticsearch6

 
创建一个索引
curl -XPUT http://192.168.73.11:9200/index


创建一个映射
curl -XPOST http://192.168.73.11:9200/index/fulltext/_mapping -H 'Content-Type:application/json' -d'
{
        "properties": {
            "content": {
                "type": "text",
                "analyzer": "ik_max_word",
                "search_analyzer": "ik_max_word"
            }
        }

}'


索引一些文档
curl -XPOST http://192.168.73.11:9200/index/fulltext/1 -H 'Content-Type:application/json' -d'
{"content":"时间是一切财富中最宝贵的财富"}
'
curl -XPOST http://192.168.73.11:9200/index/fulltext/2 -H 'Content-Type:application/json' -d'
{"content":"世界上一成不变的东西，只有“任何事物都是在不断变化的”这条真理。"}
'

curl -XPOST http://192.168.73.11:9200/index/fulltext/3 -H 'Content-Type:application/json' -d'
{"content":"要使别人喜欢你，首先你得改变对人的态度，把精神放得轻松一点，表情自然，笑容可掬，这样别人就会对你产生喜爱的感觉了。——卡耐基"}
'

python检索

 
es = Elasticsearch('http://192.168.73.11:9200')

# 索引名称
index_name = 'index'

# 执行一个简单的搜索请求
response = es.search(
    index=index_name,
    body={
        "query": {
            "match_all": {}
        }
    }
)

# 打印搜索结果
print(response['hits']['hits'])

# 关闭与Elasticsearch的连接
# es.close()

插入一个索引

 

from elasticsearch6 import Elasticsearch
import datetime

# 初始化Elasticsearch客户端
es = Elasticsearch([{'host': '192.168.73.11', 'port': 9200}])

# 创建索引
index_name = "index2"
if not es.indices.exists(index=index_name):
    es.indices.create(index=index_name)

# 插入数据
doc_id = "2"
doc_body = {"name": "张三", "age": 30, "email": "aaazhnag@example.com", "created_at": datetime.datetime.utcnow()}
response = es.index(index=index_name, id=doc_id, body=doc_body,doc_type="_doc")

# 输出响应
print(response)

 
pip install elasticsearch7

 
from elasticsearch7 import Elasticsearch, helpers

# 1. 创建Elasticsearch连接
es = Elasticsearch(
    hosts=['http://127.0.0.1:9200'],  # 服务地址与端口
    http_auth=("elastic", "aaa"),  # 用户名，密码
)

# 2. 定义索引名称
index_name = "index"

# 3. 如果索引已存在，删除它（仅供演示，实际应用时不需要这步）
if es.indices.exists(index=index_name):
    es.indices.delete(index=index_name)

# 4. 创建索引
es.indices.create(index=index_name)

# 5. 灌库指令
actions = [
    {
        "_index": index_name,
        "_source": {
            "keywords": to_keywords(para),
            "text": para
        }
    }
    for para in [
        "今天天气不错",]
]

# 6. 文本灌库
helpers.bulk(es, actions)

 
from elasticsearch import Elasticsearch, Requirements

requirements = Requirements(
    [Requirements.XpackSecurity if (es.info['security']['version'].startswith('7.') or es.info['security']['version'].startswith('8.')) else 'none']
)

es = Elasticsearch(
    'https://localhost:9200',
    basic_auth=('user', 'passwd'),
    requirements=requirements,
    verify_certificates=False,  # 如果不想验证SSL证书，可以设置为False
)

 
from elasticsearch7 import Elasticsearch, helpers
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
import re

import warnings
warnings.simplefilter("ignore")  # 屏蔽 ES 的一些Warnings


def to_keywords(input_string):
    '''（英文）文本只保留关键字'''
    # 使用正则表达式替换所有非字母数字的字符为空格
    no_symbols = re.sub(r'[^a-zA-Z0-9\s]', ' ', input_string)
    word_tokens = word_tokenize(no_symbols)
    # 加载停用词表
    stop_words = set(stopwords.words('english'))
    ps = PorterStemmer()
    # 去停用词，取词根
    filtered_sentence = [ps.stem(w)
                          for w in word_tokens if not w.lower() in stop_words]
    return ' '.join(filtered_sentence)
    

# 1. 创建Elasticsearch连接
es = Elasticsearch(
    hosts=['http://192.168.73.11:9200'],  # 服务地址与端口
    verify_certificates=False
    # http_auth=("elastic", "aaa"),  # 用户名，密码
)

# 2. 定义索引名称
index_name = "index"

# 3. 如果索引已存在，删除它（仅供演示，实际应用时不需要这步）
if es.indices.exists(index=index_name):
    es.indices.delete(index=index_name)

# 4. 创建索引
es.indices.create(index=index_name)

# 5. 灌库指令
actions = [
    {
        "_index": index_name,
        "_source": {
            "keywords": to_keywords(para),
            "text": para
        }
    }
    for para in [
        "今天天气不错",]
]

# 6. 文本灌库
helpers.bulk(es, actions)

参考

    sklearn2pmml github

    PMML讲解及使用

七三笔记路线：学习，记录，分享