Class RegistryJpaIO.Write<T>

  • Type Parameters:
    T - type of the entities to be written
    All Implemented Interfaces:
    java.io.Serializable, org.apache.beam.sdk.transforms.display.HasDisplayData
    Enclosing class:
    RegistryJpaIO

    public abstract static class RegistryJpaIO.Write<T>
    extends org.apache.beam.sdk.transforms.PTransform<org.apache.beam.sdk.values.PCollection<T>,​org.apache.beam.sdk.values.PCollection<java.lang.Void>>
    A transform that writes a PCollection of entities to the SQL database using the JpaTransactionManager.

    Unlike typical BEAM RegistryJpaIO.Write transforms, the output type of this transform is PCollection<Void> instead of PDone. This deviation allows the sequencing of multiple PCollections: we have use cases where one collection of data must be completely written before another can start (due to foreign key constraints in the latter).

    See Also:
    Serialized Form
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static int DEFAULT_BATCH_SIZE  
      static java.lang.String DEFAULT_NAME  
      static int DEFAULT_SHARDS
      The default number of write shard.
      • Fields inherited from class org.apache.beam.sdk.transforms.PTransform

        name
    • Constructor Summary

      Constructors 
      Constructor Description
      Write()  
    • Method Summary

      All Methods Instance Methods Abstract Methods Concrete Methods 
      Modifier and Type Method Description
      abstract int batchSize()
      Number of elements to be written in one call.
      RegistryJpaIO.Write<T> disableUpdateAutoTimestamp()  
      org.apache.beam.sdk.values.PCollection<java.lang.Void> expand​(org.apache.beam.sdk.values.PCollection<T> input)  
      abstract org.apache.beam.sdk.transforms.SerializableFunction<T,​java.lang.Object> jpaConverter()  
      abstract java.lang.String name()  
      abstract int shards()
      The number of shards the output should be split into.
      RegistryJpaIO.Write<T> withBatchSize​(int batchSize)  
      RegistryJpaIO.Write<T> withJpaConverter​(org.apache.beam.sdk.transforms.SerializableFunction<T,​java.lang.Object> jpaConverter)
      An optional function that converts the input entities to a form that can be written into the database.
      RegistryJpaIO.Write<T> withName​(java.lang.String name)  
      RegistryJpaIO.Write<T> withShards​(int shards)  
      • Methods inherited from class org.apache.beam.sdk.transforms.PTransform

        compose, compose, getAdditionalInputs, getDefaultOutputCoder, getDefaultOutputCoder, getDefaultOutputCoder, getKindString, getName, populateDisplayData, toString, validate
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
    • Constructor Detail

      • Write

        public Write()
    • Method Detail

      • name

        public abstract java.lang.String name()
      • batchSize

        public abstract int batchSize()
        Number of elements to be written in one call.
      • shards

        public abstract int shards()
        The number of shards the output should be split into.

        This value is a hint to the pipeline runner on the level of parallelism, and should be significantly greater than the number of threads working on this transformation (see next paragraph for more information). On the other hand, it should not be too large to the point that the number of elements per shard is lower than batchSize(). As a rule of thumb, the following constraint should hold: shards * batchSize * nThreads <= inputElementCount. Although it is not always possible to determine the number of threads working on this transform, when the pipeline run is IO-bound, it most likely is close to the total number of threads in the pipeline, which is explained below.

        With Cloud Dataflow runner, the total number of worker threads in a batch pipeline (which includes all existing Registry pipelines) is the number of vCPUs used by the pipeline, and can be set by the --maxNumWorkers and --workerMachineType parameters. The number of worker threads in a streaming pipeline can be set by the --maxNumWorkers and --numberOfWorkerHarnessThreads parameters.

        Note that connections on the database server are a limited resource, therefore the number of threads that interact with the database should be set to an appropriate limit. Again, we cannot control this number, but can influence it by controlling the total number of threads.

      • jpaConverter

        public abstract org.apache.beam.sdk.transforms.SerializableFunction<T,​java.lang.Object> jpaConverter()
      • withJpaConverter

        public RegistryJpaIO.Write<T> withJpaConverter​(org.apache.beam.sdk.transforms.SerializableFunction<T,​java.lang.Object> jpaConverter)
        An optional function that converts the input entities to a form that can be written into the database.
      • expand

        public org.apache.beam.sdk.values.PCollection<java.lang.Void> expand​(org.apache.beam.sdk.values.PCollection<T> input)
        Specified by:
        expand in class org.apache.beam.sdk.transforms.PTransform<org.apache.beam.sdk.values.PCollection<T>,​org.apache.beam.sdk.values.PCollection<java.lang.Void>>