如何自定義 Data Source
在 Data Source 介紹 文章中,我給大家介紹了 Flink Data Source 以及簡(jiǎn)短的介紹了一下自定義 Data Source,這篇文章更詳細的介紹下,并寫(xiě)一個(gè) demo 出來(lái)讓大家理解。
Flink Kafka source
我們先來(lái)看下 Flink 從 Kafka topic 中獲取數據的 demo,首先你需要安裝好了 FLink 和 Kafka 。
運行啟動(dòng) Flink、Zookepeer、Kafka,
好了,都啟動(dòng)了!
maven 依賴(lài)
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>
<flink.version>1.10.0</flink.version>
<scala.binary.version>2.11</scala.binary.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<!--日志-->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.7</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
<scope>runtime</scope>
</dependency>
<!--flink kafka connector-->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.11_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>statefun-sdk</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>statefun-flink-harness</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
</dependencies>
測試發(fā)送數據到 kafka topic
實(shí)體類(lèi),Metric.java
package com.thinker.model;
import java.util.Map;
/**
* @author zeekling [lingzhaohui@zeekling.cn]
* @version 1.0
* @apiNote 實(shí)體類(lèi)
* @since 2020-05-05
*/
public class Metric {
private String name;
private long timestamp;
private Map<String, Object> fields;
private Map<String, String> tags;
public Metric() {
}
public Metric(String name, long timestamp, Map<String, Object> fields, Map<String, String> tags) {
this.name = name;
this.timestamp = timestamp;
this.fields = fields;
this.tags = tags;
}
@Override
public String toString() {
return "Metric{" +
"name='" + name + '\'' +
", timestamp='" + timestamp + '\'' +
", fields=" + fields +
", tags=" + tags +
'}';
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public long getTimestamp() {
return timestamp;
}
public void setTimestamp(long timestamp) {
this.timestamp = timestamp;
}
public Map<String, Object> getFields() {
return fields;
}
public void setFields(Map<String, Object> fields) {
this.fields = fields;
}
public Map<String, String> getTags() {
return tags;
}
public void setTags(Map<String, String> tags) {
this.tags = tags;
}
}
往 kafka 中寫(xiě)數據工具類(lèi):KafkaUtils.java
package com.thinker.util;
import com.thinker.model.Metric;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import com.alibaba.fastjson.JSON;
import java.util.HashMap;
import java.util.Map;
import java.util.Properties;
/**
* @author zeekling [lingzhaohui@zeekling.cn]
* @version 1.0
* @apiNote 往 kafka 中寫(xiě)數據工具類(lèi):
* @since 2020-05-05
*/
public class KafkaUtils {
public static final String broker_list = "localhost:9092";
public static final String topic = "metric"; // kafka topic,Flink 程序中需要和這個(gè)統一
public static void writeToKafka() throws InterruptedException {
Properties props = new Properties();
props.put("bootstrap.servers", broker_list);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer"); //key 序列化
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); //value 序列化
KafkaProducer producer = new KafkaProducer<String, String>(props);
Metric metric = new Metric();
metric.setTimestamp(System.currentTimeMillis());
metric.setName("mem");
Map<String, String> tags = new HashMap<>();
Map<String, Object> fields = new HashMap<>();
tags.put("cluster", "zhisheng");
tags.put("host_ip", "101.147.022.106");
fields.put("used_percent", 90d);
fields.put("max", 27244873d);
fields.put("used", 17244873d);
fields.put("init", 27244873d);
metric.setTags(tags);
metric.setFields(fields);
ProducerRecord record = new ProducerRecord<String, String>(topic, null, null, JSON.toJSONString(metric));
producer.send(record);
System.out.println("發(fā)送數據: " + JSON.toJSONString(metric));
producer.flush();
}
public static void main(String[] args) throws InterruptedException {
while (true) {
Thread.sleep(300);
writeToKafka();
}
}
}
運行:
如果出現如上圖標記的,即代表能夠不斷的往 kafka 發(fā)送數據的。
Flink 程序
Main.java
package com.thinker;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011;
import java.util.Properties;
/**
* @author zeekling [lingzhaohui@zeekling.cn]
* @version 1.0
* @apiNote
* @since 2020-05-05
*/
public class Main {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("zookeeper.connect", "localhost:2181");
props.put("group.id", "metric-group");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); //key 反序列化
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("auto.offset.reset", "latest"); //value 反序列化
DataStreamSource<String> dataStreamSource = env.addSource(new FlinkKafkaConsumer011<>(
"metric", //kafka topic
new SimpleStringSchema(), // String 序列化
props)).setParallelism(1);
dataStreamSource.print(); //把從 kafka 讀取到的數據打印在控制臺
env.execute("Flink add data source");
}
}
運行起來(lái):
看到?jīng)]程序,Flink 程序控制臺能夠源源不斷的打印數據呢。
自定義 Source
上面就是 Flink 自帶的 Kafka source,那么接下來(lái)就模仿著(zhù)寫(xiě)一個(gè)從 MySQL 中讀取數據的 Source。
首先 pom.xml 中添加 MySQL 依賴(lài):
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.34</version>
</dependency>
數據庫建表如下:
DROP TABLE IF EXISTS `student`;
CREATE TABLE `student` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(25) COLLATE utf8_bin DEFAULT NULL,
`password` varchar(25) COLLATE utf8_bin DEFAULT NULL,
`age` int(10) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
插入數據
INSERT INTO `student`
VALUES
('1', 'zhisheng01', '123456', '18'),
('2', 'zhisheng02', '123', '17'),
('3', 'zhisheng03', '1234', '18'),
('4', 'zhisheng04', '12345', '16');
COMMIT;
新建實(shí)體類(lèi):Student.java
package com.thinker.model;
/**
* @author zeekling [lingzhaohui@zeekling.cn]
* @version 1.0
* @apiNote student 表的實(shí)體信息
* @since 2020-05-05
*/
public class Student {
private int id;
private String name;
private String password;
private int age;
public Student() {
}
public Student(int id, String name, String password, int age) {
this.id = id;
this.name = name;
this.password = password;
this.age = age;
}
@Override
public String toString() {
return "Student{" +
"id=" + id +
", name='" + name + '\'' +
", password='" + password + '\'' +
", age=" + age +
'}';
}
public int getId() {
return id;
}
public void setId(int id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getPassword() {
return password;
}
public void setPassword(String password) {
this.password = password;
}
public int getAge() {
return age;
}
public void setAge(int age) {
this.age = age;
}
}
新建 Source 類(lèi) SourceFromMySQL.java,該類(lèi)繼承 RichSourceFunction ,實(shí)現里面的 open、close、run、cancel 方法:
package com.thinker.sql;
import com.thinker.model.Student;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.source.RichSourceFunction;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
/**
* @author zeekling [lingzhaohui@zeekling.cn]
* @version 1.0
* @apiNote
* @since 2020-05-05
*/
public class SourceFromMySQL extends RichSourceFunction<Student> {
private PreparedStatement ps;
private Connection connection;
/**
* open() 方法中建立連接,這樣不用每次 invoke 的時(shí)候都要建立連接和釋放連接。
*
* @param parameters
* @throws Exception
*/
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
connection = getConnection();
String sql = "select * from Student;";
ps = this.connection.prepareStatement(sql);
}
/**
* 程序執行完畢就可以進(jìn)行,關(guān)閉連接和釋放資源的動(dòng)作了
*
* @throws Exception
*/
@Override
public void close() throws Exception {
super.close();
if (connection != null) { //關(guān)閉連接和釋放資源
connection.close();
}
if (ps != null) {
ps.close();
}
}
/**
* DataStream 調用一次 run() 方法用來(lái)獲取數據
*
* @param ctx
* @throws Exception
*/
@Override
public void run(SourceContext<Student> ctx) throws Exception {
ResultSet resultSet = ps.executeQuery();
while (resultSet.next()) {
Student student = new Student(
resultSet.getInt("id"),
resultSet.getString("name").trim(),
resultSet.getString("password").trim(),
resultSet.getInt("age"));
ctx.collect(student);
}
}
@Override
public void cancel() {
}
private static Connection getConnection() {
Connection con = null;
try {
Class.forName("com.mysql.jdbc.Driver");
con = DriverManager.getConnection("jdbc:mysql://localhost:3306/flink_test?useUnicode=true&characterEncoding=UTF-8", "root", "root123456");
} catch (Exception e) {
System.out.println("-----------mysql get connection has exception , msg = "+ e.getMessage());
}
return con;
}
}
Flink 程序:
package com.thinker.main;
import com.thinker.sql.SourceFromMySQL;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
/**
* @author zeekling [lingzhaohui@zeekling.cn]
* @version 1.0
* @apiNote
* @since 2020-05-05
*/
public class FlinkCustomSource {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.addSource(new SourceFromMySQL()).print();
env.execute("Flink add data sourc");
}
}
運行 Flink 程序,控制臺日志中可以看見(jiàn)打印的 student 信息。
RichSourceFunction
從上面自定義的 Source 可以看到我們繼承的就是這個(gè) RichSourceFunction 類(lèi),那么來(lái)了解一下:
一個(gè)抽象類(lèi),繼承自 AbstractRichFunction。為實(shí)現一個(gè) Rich SourceFunction 提供基礎能力。該類(lèi)的子類(lèi)有三個(gè),兩個(gè)是抽象類(lèi),在此基礎上提供了更具體的實(shí)現,另一個(gè)是 ContinuousFileMonitoringFunction。
- MessageAcknowledgingSourceBase :它針對的是數據源是消息隊列的場(chǎng)景并且提供了基于 ID 的應答機制。
- MultipleIdsMessageAcknowledgingSourceBase : 在 MessageAcknowledgingSourceBase 的基礎上針對 ID 應答機制進(jìn)行了更為細分的處理,支持兩種 ID 應答模型:session id 和 unique message id。
- ContinuousFileMonitoringFunction:這是單個(gè)(非并行)監視任務(wù),它接受 FileInputFormat,并且根據 FileProcessingMode 和 FilePathFilter,它負責監視用戶(hù)提供的路徑;決定應該進(jìn)一步讀取和處理哪些文件;創(chuàng )建與這些文件對應的 FileInputSplit 拆分,將它們分配給下游任務(wù)以進(jìn)行進(jìn)一步處理。
轉載:http://www.54tianzhisheng.cn/2018/10/30/flink-create-source/
