Skip to content

__cmp__ errors triggered in address_dupe_pairs (dedupe.py) #15

@thisisaaronland

Description

@thisisaaronland

This appears to be a variation on issue #9 meaning it's a type mismatch being triggered in postal/utils/enum.py but I am hoping you can provide some input on how to track down the root cause in order to remedy things.

That or some suggestions for how to trap this sort of thing and drop the non-result, because in the example below the EMR process ran for ~2 hours before finally failing.

The input data should be clean so I am not sure how to debug these errors...

        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1532454729366_0001/container_1532454729366_0001_02_000003/pyspark.zip/pyspark/worker.py", line 174, in main
    process()
  File "/mnt/yarn/usercache/hadoop/appcache/application_1532454729366_0001/container_1532454729366_0001_02_000003/pyspark.zip/pyspark/worker.py", line 169, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1532454729366_0001/container_1532454729366_0001_02_000003/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/lib/python2.7/site-packages/lieu/spark/dedupe.py", line 28, in <lambda>
    .filter(lambda ((uid1, uid2), (address_dupe_status, is_sub_building_dupe)): address_dupe_status in (duplicate_status.EXACT_DUPLICATE, duplicate_status.LIKELY_DUPLICATE) and is_sub_building_dup\
e) \
  File "/usr/local/lib64/python2.7/site-packages/postal/utils/enum.py", line 16, in __cmp__
    return self.value.__cmp__(other)
TypeError: long.__cmp__(x,y) requires y to be a 'long', not a 'tuple'

        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:100)
        at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:99)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions