That or some suggestions for how to trap this sort of thing and drop the non-result, because in the example below the EMR process ran for ~2 hours before finally failing.
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1532454729366_0001/container_1532454729366_0001_02_000003/pyspark.zip/pyspark/worker.py", line 174, in main
process()
File "/mnt/yarn/usercache/hadoop/appcache/application_1532454729366_0001/container_1532454729366_0001_02_000003/pyspark.zip/pyspark/worker.py", line 169, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/mnt/yarn/usercache/hadoop/appcache/application_1532454729366_0001/container_1532454729366_0001_02_000003/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/local/lib/python2.7/site-packages/lieu/spark/dedupe.py", line 28, in <lambda>
.filter(lambda ((uid1, uid2), (address_dupe_status, is_sub_building_dupe)): address_dupe_status in (duplicate_status.EXACT_DUPLICATE, duplicate_status.LIKELY_DUPLICATE) and is_sub_building_dup\
e) \
File "/usr/local/lib64/python2.7/site-packages/postal/utils/enum.py", line 16, in __cmp__
return self.value.__cmp__(other)
TypeError: long.__cmp__(x,y) requires y to be a 'long', not a 'tuple'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:100)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:99)
This appears to be a variation on issue #9 meaning it's a type mismatch being triggered in
postal/utils/enum.pybut I am hoping you can provide some input on how to track down the root cause in order to remedy things.That or some suggestions for how to trap this sort of thing and drop the non-result, because in the example below the EMR process ran for ~2 hours before finally failing.
The input data should be clean so I am not sure how to debug these errors...