GitHub - lx0612/Data-pipeline-using-python-kafka-and-mongodb-: Data-pipeline-using-python-kafka-and-mongodb-

#Data-pipeline-using-python-kafka-and-mongodb-quickstart This project contains 2 files (before the README.md).

SourceCode.py is a combination of both scraping data code and Kafka procedure code to stream data to Mongodb (data lake).

Scrapping data on Jupyter Notebook by python with selenium libs

Jobs data is pulled from topcv.vn using python.

Kafka

See web site for details on the project.

Install Kafka: single kafka: download Kafka from offcial webside :https://kafka.apache.org/ recommend download kafka with scala My project use kafka_2.13-3.2.0

wget https://dlcdn.apache.org/kafka/3.2.0/kafka_2.13-3.2.0.tgz

See more in https://kafka.apache.org/quickstart

##Connect Kafka to Mongodb:

My project use jar file mongo-connect-kafka to config sink source . Data from crawl streamming to kafka then from kafka to databases; Download jar file in git or website of Maven (https://search.maven.org/artifact/org.mongodb.kafka/mongo-kafka-connect) Copy the JAR and any dependencies into the Kafka plugins directory which you can specify in your plugin.path Create file MongoSinkConnector.properties in config of kafka. Exemple: name=mongo-sink topics=topcv connector.class=com.mongodb.kafka.connect.MongoSinkConnector tasks.max=1 key.ignore=true connection.uri=mongodb://localhost:27017 database=topcv collection=transaction max.num.retries=3 retries.defer.timeout=5000 type.name=kafka-connect schemas.enable=false Start kafka: (Open three terminal)

1. start zookeeper server
bin/zookeeper-server-start.sh config/zookeeper-server.properties

2. start kafka server
bin/kafka-server-start.sh config/server.properties

3. start connection
bin/connect-standalone.sh config/connect-standalone.properties config/MongoSinkConnector.properties

Test

1.Open terminal

2.Create topic test:
    bin/kafka-topic.sh --topic topcv --bootstrap-server localhost:9092

3.Start kafka producer:
    bin/kafka-producer.sh --topic topcv --bootstrap-server localhost:9092
Send message
    {"hello":"world"}

4. Check in database(Mongo)
    in orther terminal and run
    mongosh # start mongoshell
    show databases #show all database
    use test #change to database test
    show collections #show all collections in database
    db.transaction.find() #show all record in collections
    if dislay the message , the connect is success

Crawl data and stream kafka to mongodb

run file crawl.py on window with cmd REQUIRED install python and some lib (kafka-python,beautifulsoup4,selenium) download chromedrive (chekversion) file crawl.py get data in http://topcv.vn

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitattributes		.gitattributes
README.md		README.md
SourceCode.py		SourceCode.py
mongo-kafka-connect-1.7.0-all.jar		mongo-kafka-connect-1.7.0-all.jar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

lx0612/Data-pipeline-using-python-kafka-and-mongodb-

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages