Spark JoinWithCassandraTable on TimeStamp 分区键 STUCK

2021-12-31 00:00:00 apache-spark scala mysql cassandra datastax-enterprise

我正在尝试使用以下方法过滤巨大的 C* 表的一小部分:

I'm trying to filter on a small part of a huge C* table by using:

val snapshotsFiltered = sc.parallelize(startDate to endDate).map(TableKey(_)).joinWithCassandraTable("listener","snapshots_tspark") println("Done Join") //******* //get only the snapshots and create rdd temp table val jsons = snapshotsFiltered.map(_._2.getString("snapshot")) val jsonSchemaRDD = sqlContext.jsonRDD(jsons) jsonSchemaRDD.registerTempTable("snapshots_json")

与:

case class TableKey(created: Long) //(created, imei, when)--> created = partititon key | imei, when = clustering key

而 cassandra 表架构是:

And the cassandra table schema is:

CREATE TABLE listener.snapshots_tspark ( created timestamp, imei text, when timestamp, snapshot text, PRIMARY KEY (created, imei, when) ) WITH CLUSTERING ORDER BY (imei ASC, when ASC) AND bloom_filter_fp_chance = 0.01 AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}' AND comment = '' AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE';

问题是在 println 完成后进程冻结，在 spark master ui 上没有错误.

The problem is that the process freezes after the println done with no errors on spark master ui.

[Stage 0:> (0 + 2) / 2]

Join 不能使用时间戳作为分区键吗?为什么会冻结?

Won`t the Join work with timestamp as the partition key? Why it freezes?

推荐答案

通过使用:

sc.parallelize(startDate to endDate)

将 startData 和 endDate 作为 Longs 从日期生成的格式:

With the startData and endDate as Longs generated from Dates by the format:

("yyyy-MM-dd HH:mm:ss")

我使用 spark 来构建一个巨大的数组(100,000 多个对象)以连接 C* 表，它根本没有卡住 - C* 努力使连接发生并返回数据.

I made spark to build a huge array (100,000+ objects) to join with C* table and it didn't stuck at all- C* worked hard to make the join happen and return the data.

最后，我将范围更改为:

Finally, I changed my range to:

case class TableKey(created_dh: String) val data = Array("2015-10-29 12:00:00", "2015-10-29 13:00:00", "2015-10-29 14:00:00", "2015-10-29 15:00:00") val snapshotsFiltered = sc.parallelize(data, 2).map(TableKey(_)).joinWithCassandraTable("listener","snapshots_tnew")

现在好了.

相关文章