first implementation of bench.c with new API ZSTD_compress_generic()

Doesn't speed optimize this buffer-to-buffer scenario yet.
Still internally defers to streaming implementation.

Also : fixed a long standing bug in ZSTDMT streaming API.
diff --git a/doc/zstd_manual.html b/doc/zstd_manual.html
index cf51daa..88e8907 100644
--- a/doc/zstd_manual.html
+++ b/doc/zstd_manual.html
@@ -13,7 +13,7 @@
 <li><a href="#Chapter3">Simple API</a></li>
 <li><a href="#Chapter4">Explicit memory management</a></li>
 <li><a href="#Chapter5">Simple dictionary API</a></li>
-<li><a href="#Chapter6">Fast dictionary API</a></li>
+<li><a href="#Chapter6">Bulk processing dictionary API</a></li>
 <li><a href="#Chapter7">Streaming</a></li>
 <li><a href="#Chapter8">Streaming compression - HowTo</a></li>
 <li><a href="#Chapter9">Streaming decompression - HowTo</a></li>
@@ -40,18 +40,18 @@
     - a single step (described as Simple API)
     - a single step, reusing a context (described as Explicit memory management)
     - unbounded multiple steps (described as Streaming compression)
-  The compression ratio achievable on small data can be highly improved using compression with a dictionary in:
+  The compression ratio achievable on small data can be highly improved using a dictionary in:
     - a single step (described as Simple dictionary API)
     - a single step, reusing a dictionary (described as Fast dictionary API)
 
   Advanced experimental functions can be accessed using #define ZSTD_STATIC_LINKING_ONLY before including zstd.h.
-  These APIs shall never be used with a dynamic library.
+  Advanced experimental APIs shall never be used with a dynamic library.
   They are not "stable", their definition may change in the future. Only static linking is allowed.
 <BR></pre>
 
 <a name="Chapter2"></a><h2>Version</h2><pre></pre>
 
-<pre><b>unsigned ZSTD_versionNumber(void);   </b>/**< to be used when checking dll version */<b>
+<pre><b>unsigned ZSTD_versionNumber(void);   </b>/**< useful to check dll version */<b>
 </b></pre><BR>
 <a name="Chapter3"></a><h2>Simple API</h2><pre></pre>
 
@@ -67,7 +67,7 @@
 <pre><b>size_t ZSTD_decompress( void* dst, size_t dstCapacity,
                   const void* src, size_t compressedSize);
 </b><p>  `compressedSize` : must be the _exact_ size of some number of compressed and/or skippable frames.
-  `dstCapacity` is an upper bound of originalSize.
+  `dstCapacity` is an upper bound of originalSize to regenerate.
   If user cannot imply a maximum upper bound, it's better to use streaming mode to decompress data.
   @return : the number of bytes decompressed into `dst` (<= `dstCapacity`),
             or an errorCode if it fails (which can be tested using ZSTD_isError()). 
@@ -140,33 +140,33 @@
                          const void* src, size_t srcSize,
                          const void* dict,size_t dictSize,
                                int compressionLevel);
-</b><p>   Compression using a predefined Dictionary (see dictBuilder/zdict.h).
-   Note : This function loads the dictionary, resulting in significant startup delay.
-   Note : When `dict == NULL || dictSize < 8` no dictionary is used. 
+</b><p>  Compression using a predefined Dictionary (see dictBuilder/zdict.h).
+  Note : This function loads the dictionary, resulting in significant startup delay.
+  Note : When `dict == NULL || dictSize < 8` no dictionary is used. 
 </p></pre><BR>
 
 <pre><b>size_t ZSTD_decompress_usingDict(ZSTD_DCtx* dctx,
                                  void* dst, size_t dstCapacity,
                            const void* src, size_t srcSize,
                            const void* dict,size_t dictSize);
-</b><p>   Decompression using a predefined Dictionary (see dictBuilder/zdict.h).
-   Dictionary must be identical to the one used during compression.
-   Note : This function loads the dictionary, resulting in significant startup delay.
-   Note : When `dict == NULL || dictSize < 8` no dictionary is used. 
+</b><p>  Decompression using a predefined Dictionary (see dictBuilder/zdict.h).
+  Dictionary must be identical to the one used during compression.
+  Note : This function loads the dictionary, resulting in significant startup delay.
+  Note : When `dict == NULL || dictSize < 8` no dictionary is used. 
 </p></pre><BR>
 
-<a name="Chapter6"></a><h2>Fast dictionary API</h2><pre></pre>
+<a name="Chapter6"></a><h2>Bulk processing dictionary API</h2><pre></pre>
 
 <pre><b>ZSTD_CDict* ZSTD_createCDict(const void* dictBuffer, size_t dictSize,
                              int compressionLevel);
-</b><p>   When compressing multiple messages / blocks with the same dictionary, it's recommended to load it just once.
-   ZSTD_createCDict() will create a digested dictionary, ready to start future compression operations without startup delay.
-   ZSTD_CDict can be created once and used by multiple threads concurrently, as its usage is read-only.
-   `dictBuffer` can be released after ZSTD_CDict creation, as its content is copied within CDict 
+</b><p>  When compressing multiple messages / blocks with the same dictionary, it's recommended to load it just once.
+  ZSTD_createCDict() will create a digested dictionary, ready to start future compression operations without startup delay.
+  ZSTD_CDict can be created once and shared by multiple threads concurrently, since its usage is read-only.
+  `dictBuffer` can be released after ZSTD_CDict creation, since its content is copied within CDict 
 </p></pre><BR>
 
 <pre><b>size_t      ZSTD_freeCDict(ZSTD_CDict* CDict);
-</b><p>   Function frees memory allocated by ZSTD_createCDict(). 
+</b><p>  Function frees memory allocated by ZSTD_createCDict(). 
 </p></pre><BR>
 
 <pre><b>size_t ZSTD_compress_usingCDict(ZSTD_CCtx* cctx,
@@ -180,20 +180,20 @@
 </p></pre><BR>
 
 <pre><b>ZSTD_DDict* ZSTD_createDDict(const void* dictBuffer, size_t dictSize);
-</b><p>   Create a digested dictionary, ready to start decompression operation without startup delay.
-   dictBuffer can be released after DDict creation, as its content is copied inside DDict 
+</b><p>  Create a digested dictionary, ready to start decompression operation without startup delay.
+  dictBuffer can be released after DDict creation, as its content is copied inside DDict 
 </p></pre><BR>
 
 <pre><b>size_t      ZSTD_freeDDict(ZSTD_DDict* ddict);
-</b><p>   Function frees memory allocated with ZSTD_createDDict() 
+</b><p>  Function frees memory allocated with ZSTD_createDDict() 
 </p></pre><BR>
 
 <pre><b>size_t ZSTD_decompress_usingDDict(ZSTD_DCtx* dctx,
                                   void* dst, size_t dstCapacity,
                             const void* src, size_t srcSize,
                             const ZSTD_DDict* ddict);
-</b><p>   Decompression using a digested Dictionary.
-   Faster startup than ZSTD_decompress_usingDict(), recommended when same dictionary is used multiple times. 
+</b><p>  Decompression using a digested Dictionary.
+  Faster startup than ZSTD_decompress_usingDict(), recommended when same dictionary is used multiple times. 
 </p></pre><BR>
 
 <a name="Chapter7"></a><h2>Streaming</h2><pre></pre>
@@ -250,7 +250,7 @@
  
 <BR></pre>
 
-<pre><b>typedef ZSTD_CCtx ZSTD_CStream;  </b>/**< CCtx and CStream are effectively same object */<b>
+<pre><b>typedef ZSTD_CCtx ZSTD_CStream;  </b>/**< CCtx and CStream are now effectively same object (>= v1.3.0) */<b>
 </b></pre><BR>
 <h3>ZSTD_CStream management functions</h3><pre></pre><b><pre>ZSTD_CStream* ZSTD_createCStream(void);
 size_t ZSTD_freeCStream(ZSTD_CStream* zcs);
@@ -285,6 +285,8 @@
  
 <BR></pre>
 
+<pre><b>typedef ZSTD_DCtx ZSTD_DStream;  </b>/**< DCtx and DStream are now effectively same object (>= v1.3.0) */<b>
+</b></pre><BR>
 <h3>ZSTD_DStream management functions</h3><pre></pre><b><pre>ZSTD_DStream* ZSTD_createDStream(void);
 size_t ZSTD_freeDStream(ZSTD_DStream* zds);
 </pre></b><BR>
diff --git a/lib/compress/zstd_compress.c b/lib/compress/zstd_compress.c
index 49d29dc..f5ac75a 100644
--- a/lib/compress/zstd_compress.c
+++ b/lib/compress/zstd_compress.c
@@ -571,7 +571,7 @@
 }
 
 /*! ZSTD_continueCCtx() :
-    reuse CCtx without reset (note : requires no dictionary) */
+ *  reuse CCtx without reset (note : requires no dictionary) */
 static size_t ZSTD_continueCCtx(ZSTD_CCtx* cctx, ZSTD_parameters params, U64 pledgedSrcSize)
 {
     U32 const end = (U32)(cctx->nextSrc - cctx->base);
@@ -3831,6 +3831,7 @@
 
 #ifdef ZSTD_MULTITHREAD
         if (cctx->nbThreads > 1) {
+            DEBUGLOG(4, "call ZSTDMT_initCStream_internal");
             CHECK_F( ZSTDMT_initCStream_internal(cctx->mtctx, NULL, 0, cctx->cdict, params, cctx->pledgedSrcSizePlusOne-1) );
             cctx->streamStage = zcss_load;
         } else
@@ -3842,7 +3843,7 @@
 #ifdef ZSTD_MULTITHREAD
     if (cctx->nbThreads > 1) {
         size_t const flushMin = ZSTDMT_compressStream_generic(cctx->mtctx, output, input, endOp);
-        DEBUGLOG(5, "ZSTDMT result : %u", (U32)flushMin);
+        DEBUGLOG(4, "ZSTDMT_compressStream_generic : %u", (U32)flushMin);
         if ( ZSTD_isError(flushMin)
           || (endOp == ZSTD_e_end && flushMin == 0) ) { /* compression completed */
             ZSTD_startNewCompression(cctx);
diff --git a/lib/compress/zstdmt_compress.c b/lib/compress/zstdmt_compress.c
index 42e31ca..895d362 100644
--- a/lib/compress/zstdmt_compress.c
+++ b/lib/compress/zstdmt_compress.c
@@ -190,7 +190,7 @@
     cctxPool->availCCtx = 1;   /* at least one cctx for single-thread mode */
     cctxPool->cctx[0] = ZSTD_createCCtx_advanced(cMem);
     if (!cctxPool->cctx[0]) { ZSTDMT_freeCCtxPool(cctxPool); return NULL; }
-    DEBUGLOG(3, "cctxPool created, with %u threads", nbThreads);
+    DEBUGLOG(4, "cctxPool created, with %u threads", nbThreads);
     return cctxPool;
 }
 
@@ -261,11 +261,11 @@
     ZSTDMT_jobDescription* const job = (ZSTDMT_jobDescription*)jobDescription;
     const void* const src = (const char*)job->srcStart + job->dictSize;
     buffer_t const dstBuff = job->dstBuff;
-    DEBUGLOG(3, "job (first:%u) (last:%u) : dictSize %u, srcSize %u",
+    DEBUGLOG(4, "job (first:%u) (last:%u) : dictSize %u, srcSize %u",
                  job->firstChunk, job->lastChunk, (U32)job->dictSize, (U32)job->srcSize);
     if (job->cdict) {  /* should only happen for first segment */
         size_t const initError = ZSTD_compressBegin_usingCDict_advanced(job->cctx, job->cdict, job->params.fParams, job->fullFrameSize);
-        DEBUGLOG(3, "using CDict");
+        DEBUGLOG(5, "using CDict");
         if (ZSTD_isError(initError)) { job->cSize = initError; goto _endJob; }
     } else {  /* srcStart points at reloaded section */
         if (!job->firstChunk) job->params.fParams.contentSizeFlag = 0;  /* ensure no srcSize control */
@@ -280,12 +280,12 @@
         ZSTD_invalidateRepCodes(job->cctx);
     }
 
-    DEBUGLOG(4, "Compressing : ");
+    DEBUGLOG(5, "Compressing : ");
     DEBUG_PRINTHEX(4, job->srcStart, 12);
     job->cSize = (job->lastChunk) ?
                  ZSTD_compressEnd     (job->cctx, dstBuff.start, dstBuff.size, src, job->srcSize) :
                  ZSTD_compressContinue(job->cctx, dstBuff.start, dstBuff.size, src, job->srcSize);
-    DEBUGLOG(3, "compressed %u bytes into %u bytes   (first:%u) (last:%u)",
+    DEBUGLOG(4, "compressed %u bytes into %u bytes   (first:%u) (last:%u)",
                 (unsigned)job->srcSize, (unsigned)job->cSize, job->firstChunk, job->lastChunk);
     DEBUGLOG(5, "dstBuff.size : %u ; => %s", (U32)dstBuff.size, ZSTD_getErrorName(job->cSize));
 
@@ -426,7 +426,7 @@
         mtctx->sectionSize = value;
         return 0;
     case ZSTDMT_p_overlapSectionLog :
-        DEBUGLOG(4, "ZSTDMT_p_overlapSectionLog : %u", value);
+        DEBUGLOG(5, "ZSTDMT_p_overlapSectionLog : %u", value);
         mtctx->overlapRLog = (value >= 9) ? 0 : 9 - value;
         return 0;
     default :
@@ -457,8 +457,8 @@
     unsigned const compressWithinDst = (dstCapacity >= ZSTD_compressBound(srcSize)) ? nbChunks : (unsigned)(dstCapacity / ZSTD_compressBound(avgChunkSize));  /* presumes avgChunkSize >= 256 KB, which should be the case */
     size_t frameStartPos = 0, dstBufferPos = 0;
 
-    DEBUGLOG(3, "windowLog : %2u => chunkTargetSize : %u bytes  ", params.cParams.windowLog, (U32)chunkTargetSize);
-    DEBUGLOG(3, "nbChunks  : %2u   (chunkSize : %u bytes)   ", nbChunks, (U32)avgChunkSize);
+    DEBUGLOG(4, "windowLog : %2u => chunkTargetSize : %u bytes  ", params.cParams.windowLog, (U32)chunkTargetSize);
+    DEBUGLOG(4, "nbChunks  : %2u   (chunkSize : %u bytes)   ", nbChunks, (U32)avgChunkSize);
     params.fParams.contentSizeFlag = 1;
 
     if (nbChunks==1) {   /* fallback to single-thread mode */
@@ -495,8 +495,8 @@
             mtctx->jobs[u].jobCompleted_mutex = &mtctx->jobCompleted_mutex;
             mtctx->jobs[u].jobCompleted_cond = &mtctx->jobCompleted_cond;
 
-            DEBUGLOG(3, "posting job %u   (%u bytes)", u, (U32)chunkSize);
-            DEBUG_PRINTHEX(3, mtctx->jobs[u].srcStart, 12);
+            DEBUGLOG(4, "posting job %u   (%u bytes)", u, (U32)chunkSize);
+            DEBUG_PRINTHEX(5, mtctx->jobs[u].srcStart, 12);
             POOL_add(mtctx->factory, ZSTDMT_compressChunk, &mtctx->jobs[u]);
 
             frameStartPos += chunkSize;
@@ -508,14 +508,14 @@
     {   unsigned chunkID;
         size_t error = 0, dstPos = 0;
         for (chunkID=0; chunkID<nbChunks; chunkID++) {
-            DEBUGLOG(3, "waiting for chunk %u ", chunkID);
+            DEBUGLOG(5, "waiting for chunk %u ", chunkID);
             PTHREAD_MUTEX_LOCK(&mtctx->jobCompleted_mutex);
             while (mtctx->jobs[chunkID].jobCompleted==0) {
-                DEBUGLOG(4, "waiting for jobCompleted signal from chunk %u", chunkID);
+                DEBUGLOG(5, "waiting for jobCompleted signal from chunk %u", chunkID);
                 pthread_cond_wait(&mtctx->jobCompleted_cond, &mtctx->jobCompleted_mutex);
             }
             pthread_mutex_unlock(&mtctx->jobCompleted_mutex);
-            DEBUGLOG(3, "ready to write chunk %u ", chunkID);
+            DEBUGLOG(5, "ready to write chunk %u ", chunkID);
 
             ZSTDMT_releaseCCtx(mtctx->cctxPool, mtctx->jobs[chunkID].cctx);
             mtctx->jobs[chunkID].cctx = NULL;
@@ -533,7 +533,7 @@
                 dstPos += cSize ;
             }
         }
-        if (!error) DEBUGLOG(3, "compressed size : %u  ", (U32)dstPos);
+        if (!error) DEBUGLOG(4, "compressed size : %u  ", (U32)dstPos);
         return error ? error : dstPos;
     }
 
@@ -550,7 +550,7 @@
         unsigned const jobID = zcs->doneJobID & zcs->jobIDMask;
         PTHREAD_MUTEX_LOCK(&zcs->jobCompleted_mutex);
         while (zcs->jobs[jobID].jobCompleted==0) {
-            DEBUGLOG(4, "waiting for jobCompleted signal from chunk %u", zcs->doneJobID);   /* we want to block when waiting for data to flush */
+            DEBUGLOG(5, "waiting for jobCompleted signal from chunk %u", zcs->doneJobID);   /* we want to block when waiting for data to flush */
             pthread_cond_wait(&zcs->jobCompleted_cond, &zcs->jobCompleted_mutex);
         }
         pthread_mutex_unlock(&zcs->jobCompleted_mutex);
@@ -593,12 +593,12 @@
     }
 
     zcs->targetDictSize = (zcs->overlapRLog>=9) ? 0 : (size_t)1 << (zcs->params.cParams.windowLog - zcs->overlapRLog);
-    DEBUGLOG(4, "overlapRLog : %u ", zcs->overlapRLog);
-    DEBUGLOG(3, "overlap Size : %u KB", (U32)(zcs->targetDictSize>>10));
+    DEBUGLOG(5, "overlapRLog : %u ", zcs->overlapRLog);
+    DEBUGLOG(5, "overlap Size : %u KB", (U32)(zcs->targetDictSize>>10));
     zcs->targetSectionSize = zcs->sectionSize ? zcs->sectionSize : (size_t)1 << (zcs->params.cParams.windowLog + 2);
     zcs->targetSectionSize = MAX(ZSTDMT_SECTION_SIZE_MIN, zcs->targetSectionSize);
     zcs->targetSectionSize = MAX(zcs->targetDictSize, zcs->targetSectionSize);
-    DEBUGLOG(3, "Section Size : %u KB", (U32)(zcs->targetSectionSize>>10));
+    DEBUGLOG(5, "Section Size : %u KB", (U32)(zcs->targetSectionSize>>10));
     zcs->marginSize = zcs->targetSectionSize >> 2;
     zcs->inBuffSize = zcs->targetDictSize + zcs->targetSectionSize + zcs->marginSize;
     zcs->inBuff.buffer = ZSTDMT_getBuffer(zcs->buffPool, zcs->inBuffSize);
@@ -628,7 +628,8 @@
     ZSTD_parameters params = ZSTD_getParamsFromCDict(cdict);
     if (cdict==NULL) return ERROR(GENERIC);   /* method incompatible with NULL cdict */
     params.fParams = fParams;
-    return ZSTDMT_initCStream_internal(mtctx, NULL, 0, cdict, params, pledgedSrcSize);
+    return ZSTDMT_initCStream_internal(mtctx, NULL, 0 /*dictSize*/, cdict,
+                                        params, pledgedSrcSize);
 }
 
 
@@ -686,6 +687,7 @@
     /* get a new buffer for next input */
     if (!endFrame) {
         size_t const newDictSize = MIN(srcSize + zcs->dictSize, zcs->targetDictSize);
+        DEBUGLOG(5, "ZSTDMT_createCompressionJob::endFrame = %u", endFrame);
         zcs->inBuff.buffer = ZSTDMT_getBuffer(zcs->buffPool, zcs->inBuffSize);
         if (zcs->inBuff.buffer.start == NULL) {   /* not enough memory to allocate next input buffer */
             zcs->jobs[jobID].jobCompleted = 1;
@@ -694,9 +696,9 @@
             ZSTDMT_releaseAllJobResources(zcs);
             return ERROR(memory_allocation);
         }
-        DEBUGLOG(5, "inBuff filled to %u", (U32)zcs->inBuff.filled);
+        DEBUGLOG(5, "inBuff currently filled to %u", (U32)zcs->inBuff.filled);
         zcs->inBuff.filled -= srcSize + zcs->dictSize - newDictSize;
-        DEBUGLOG(5, "new job : filled to %u, with %u dict and %u src",
+        DEBUGLOG(5, "new job : inBuff filled to %u, with %u dict and %u src",
                     (U32)zcs->inBuff.filled, (U32)newDictSize,
                     (U32)(zcs->inBuff.filled - newDictSize));
         memmove(zcs->inBuff.buffer.start,
@@ -705,6 +707,7 @@
         DEBUGLOG(5, "new inBuff pre-filled");
         zcs->dictSize = newDictSize;
     } else {   /* if (endFrame==1) */
+        DEBUGLOG(5, "ZSTDMT_createCompressionJob::endFrame = %u", endFrame);
         zcs->inBuff.buffer = g_nullBuffer;
         zcs->inBuff.filled = 0;
         zcs->dictSize = 0;
@@ -714,7 +717,7 @@
             zcs->params.fParams.checksumFlag = 0;
     }
 
-    DEBUGLOG(3, "posting job %u : %u bytes  (end:%u) (note : doneJob = %u=>%u)",
+    DEBUGLOG(4, "posting job %u : %u bytes  (end:%u) (note : doneJob = %u=>%u)",
                 zcs->nextJobID,
                 (U32)zcs->jobs[jobID].srcSize,
                 zcs->jobs[jobID].lastChunk,
@@ -757,7 +760,7 @@
                 XXH64_update(&zcs->xxhState, (const char*)job.srcStart + job.dictSize, job.srcSize);
                 if (zcs->frameEnded && (zcs->doneJobID+1 == zcs->nextJobID)) {  /* write checksum at end of last section */
                     U32 const checksum = (U32)XXH64_digest(&zcs->xxhState);
-                    DEBUGLOG(4, "writing checksum : %08X \n", checksum);
+                    DEBUGLOG(5, "writing checksum : %08X \n", checksum);
                     MEM_writeLE32((char*)job.dstBuff.start + job.cSize, checksum);
                     job.cSize += 4;
                     zcs->jobs[wJobID].cSize += 4;
@@ -792,23 +795,25 @@
 size_t ZSTDMT_compressStream(ZSTDMT_CCtx* zcs, ZSTD_outBuffer* output, ZSTD_inBuffer* input)
 {
     size_t const newJobThreshold = zcs->dictSize + zcs->targetSectionSize + zcs->marginSize;
-    if (zcs->frameEnded)
-        /* current frame being ended. Only flush is allowed. Restart with init */
+    if (zcs->frameEnded) {
+        /* current frame being ended. Only flush is allowed. Or start new job with init */
+        DEBUGLOG(5, "ZSTDMT_compressStream: zcs::frameEnded==1");
         return ERROR(stage_wrong);
+    }
     if (zcs->nbThreads==1) {
         return ZSTD_compressStream(zcs->cctxPool->cctx[0], output, input);
     }
 
     /* fill input buffer */
     {   size_t const toLoad = MIN(input->size - input->pos, zcs->inBuffSize - zcs->inBuff.filled);
-        memcpy((char*)zcs->inBuff.buffer.start + zcs->inBuff.filled, input->src, toLoad);
+        memcpy((char*)zcs->inBuff.buffer.start + zcs->inBuff.filled, (const char*)input->src + input->pos, toLoad);
         input->pos += toLoad;
         zcs->inBuff.filled += toLoad;
     }
 
     if ( (zcs->inBuff.filled >= newJobThreshold)  /* filled enough : let's compress */
         && (zcs->nextJobID <= zcs->doneJobID + zcs->jobIDMask) ) {   /* avoid overwriting job round buffer */
-        CHECK_F( ZSTDMT_createCompressionJob(zcs, zcs->targetSectionSize, 0 /* blockToFlush */) );
+        CHECK_F( ZSTDMT_createCompressionJob(zcs, zcs->targetSectionSize, 0 /* endFrame */) );
     }
 
     /* check for data to flush */
@@ -824,7 +829,8 @@
     size_t const srcSize = zcs->inBuff.filled - zcs->dictSize;
 
     if (srcSize)
-        DEBUGLOG(5, "flushing : %u bytes left to compress", (U32)srcSize);
+        DEBUGLOG(5, "ZSTDMT_flushStream_internal : %u bytes left to compress",
+                    (U32)srcSize);
     if ( ((srcSize > 0) || (endFrame && !zcs->frameEnded))
        && (zcs->nextJobID <= zcs->doneJobID + zcs->jobIDMask) ) {
         DEBUGLOG(5, "create new job with %u bytes to compress", (U32)srcSize);
@@ -842,6 +848,7 @@
 
 size_t ZSTDMT_flushStream(ZSTDMT_CCtx* zcs, ZSTD_outBuffer* output)
 {
+    DEBUGLOG(5, "ZSTDMT_flushStream");
     if (zcs->nbThreads==1)
         return ZSTD_flushStream(zcs->cctxPool->cctx[0], output);
     return ZSTDMT_flushStream_internal(zcs, output, 0 /* endFrame */);
@@ -849,6 +856,7 @@
 
 size_t ZSTDMT_endStream(ZSTDMT_CCtx* zcs, ZSTD_outBuffer* output)
 {
+    DEBUGLOG(5, "ZSTDMT_endStream");
     if (zcs->nbThreads==1)
         return ZSTD_endStream(zcs->cctxPool->cctx[0], output);
     return ZSTDMT_flushStream_internal(zcs, output, 1 /* endFrame */);
@@ -859,17 +867,21 @@
                                      ZSTD_inBuffer* input,
                                      ZSTD_EndDirective endOp)
 {
+    DEBUGLOG(5, "in: pos:%u / size:%u ; endOp=%u",
+                (U32)input->pos, (U32)input->size, (U32)endOp);
     if (input->pos < input->size)  /* exclude final flushes */
         CHECK_F(ZSTDMT_compressStream(mtctx, output, input));
+    if (input->pos < input->size) endOp = ZSTD_e_continue;
     switch(endOp)
     {
         case ZSTD_e_flush:
             return ZSTDMT_flushStream(mtctx, output);
         case ZSTD_e_end:
+            DEBUGLOG(5, "endOp:%u; calling ZSTDMT_endStream", (U32)endOp);
             return ZSTDMT_endStream(mtctx, output);
         case ZSTD_e_continue:
             return 1;
         default:
-            return ERROR(GENERIC);
+            return ERROR(GENERIC);   /* invalid endDirective */
     }
 }
diff --git a/programs/bench.c b/programs/bench.c
index 1f1b005..6f09a56 100644
--- a/programs/bench.c
+++ b/programs/bench.c
@@ -158,7 +158,6 @@
                         const ZSTD_compressionParameters* comprParams)
 {
     size_t const blockSize = ((g_blockSize>=32 && !g_decodeOnly) ? g_blockSize : srcSize) + (!srcSize) /* avoid div by 0 */ ;
-    size_t const avgSize = MIN(blockSize, (srcSize / nbFiles));
     U32 const maxNbBlocks = (U32) ((srcSize + (blockSize-1)) / blockSize) + nbFiles;
     blockParam_t* const blockTable = (blockParam_t*) malloc(maxNbBlocks * sizeof(blockParam_t));
     size_t const maxCompressedSize = ZSTD_compressBound(srcSize) + (maxNbBlocks * 1024);   /* add some room for safety */
@@ -261,42 +260,72 @@
                 UTIL_getTime(&clockStart);
 
                 if (!cCompleted) {   /* still some time to do compression tests */
-                    ZSTD_customMem const cmem = { NULL, NULL, NULL };
                     U64 const clockLoop = g_nbSeconds ? TIMELOOP_MICROSEC : 1;
                     U32 nbLoops = 0;
+                    ZSTD_CDict* cdict = NULL;
+#ifdef ZSTD_NEWAPI
+                    ZSTD_CCtx_setParameter(ctx, ZSTD_p_nbThreads, g_nbThreads);
+                    ZSTD_CCtx_setParameter(ctx, ZSTD_p_compressionLevel, cLevel);
+                    ZSTD_CCtx_setParameter(ctx, ZSTD_p_windowLog, comprParams->windowLog);
+                    ZSTD_CCtx_setParameter(ctx, ZSTD_p_chainLog, comprParams->chainLog);
+                    ZSTD_CCtx_setParameter(ctx, ZSTD_p_searchLog, comprParams->searchLog);
+                    ZSTD_CCtx_setParameter(ctx, ZSTD_p_minMatch, comprParams->searchLength);
+                    ZSTD_CCtx_setParameter(ctx, ZSTD_p_targetLength, comprParams->targetLength);
+                    ZSTD_CCtx_setParameter(ctx, ZSTD_p_compressionStrategy, comprParams->strategy);
+                    ZSTD_CCtx_loadDictionary(ctx, dictBuffer, dictBufferSize);
+#else
+                    size_t const avgSize = MIN(blockSize, (srcSize / nbFiles));
                     ZSTD_parameters zparams = ZSTD_getParams(cLevel, avgSize, dictBufferSize);
-                    ZSTD_CDict* cdict;
+                    ZSTD_customMem const cmem = { NULL, NULL, NULL };
                     if (comprParams->windowLog) zparams.cParams.windowLog = comprParams->windowLog;
                     if (comprParams->chainLog) zparams.cParams.chainLog = comprParams->chainLog;
                     if (comprParams->hashLog) zparams.cParams.hashLog = comprParams->hashLog;
                     if (comprParams->searchLog) zparams.cParams.searchLog = comprParams->searchLog;
                     if (comprParams->searchLength) zparams.cParams.searchLength = comprParams->searchLength;
                     if (comprParams->targetLength) zparams.cParams.targetLength = comprParams->targetLength;
-                    if (comprParams->strategy) zparams.cParams.strategy = (ZSTD_strategy)(comprParams->strategy - 1);
+                    if (comprParams->strategy) zparams.cParams.strategy = comprParams->strategy;
                     cdict = ZSTD_createCDict_advanced(dictBuffer, dictBufferSize, 1, zparams.cParams, cmem);
                     if (cdict==NULL) EXM_THROW(1, "ZSTD_createCDict_advanced() allocation failure");
+#endif
                     do {
                         U32 blockNb;
-                        size_t rSize;
                         for (blockNb=0; blockNb<nbBlocks; blockNb++) {
+                            size_t rSize;
+#ifdef ZSTD_NEWAPI
+                            ZSTD_outBuffer out = { blockTable[blockNb].cPtr,  blockTable[blockNb].cRoom, 0 };
+                            ZSTD_inBuffer in = { blockTable[blockNb].srcPtr,  blockTable[blockNb].srcSize, 0 };
+                            size_t cError = 1;
+                            while (cError) {
+                                cError = ZSTD_compress_generic(ctx,
+                                                    &out, &in, ZSTD_e_end);
+                                if (ZSTD_isError(cError))
+                                    EXM_THROW(1, "ZSTD_compress_generic() error : %s",
+                                                ZSTD_getErrorName(cError));
+                            }
+                            rSize = out.pos;
+#else  /* ! ZSTD_NEWAPI */
                             if (dictBufferSize) {
                                 rSize = ZSTD_compress_usingCDict(ctx,
                                                 blockTable[blockNb].cPtr,  blockTable[blockNb].cRoom,
                                                 blockTable[blockNb].srcPtr,blockTable[blockNb].srcSize,
                                                 cdict);
                             } else {
-#ifdef ZSTD_MULTITHREAD         /* note : limitation : MT single-pass does not support compression with dictionary */
+#  ifdef ZSTD_MULTITHREAD       /* note : limitation : MT single-pass does not support compression with dictionary */
                                 rSize = ZSTDMT_compressCCtx(mtctx,
                                                 blockTable[blockNb].cPtr,  blockTable[blockNb].cRoom,
                                                 blockTable[blockNb].srcPtr,blockTable[blockNb].srcSize,
                                                 cLevel);
-#else
+#  else
                                 rSize = ZSTD_compress_advanced (ctx,
                                                 blockTable[blockNb].cPtr,  blockTable[blockNb].cRoom,
-                                                blockTable[blockNb].srcPtr,blockTable[blockNb].srcSize, NULL, 0, zparams);
-#endif
+                                                blockTable[blockNb].srcPtr,blockTable[blockNb].srcSize,
+                                                NULL, 0, zparams);
+#  endif
                             }
-                            if (ZSTD_isError(rSize)) EXM_THROW(1, "ZSTD_compress_usingCDict() failed : %s", ZSTD_getErrorName(rSize));
+                            if (ZSTD_isError(rSize))
+                                EXM_THROW(1, "ZSTD_compress_usingCDict() failed : %s",
+                                            ZSTD_getErrorName(rSize));
+#endif  /* ZSTD_NEWAPI */
                             blockTable[blockNb].cSize = rSize;
                         }
                         nbLoops++;
@@ -382,7 +411,7 @@
                             }
                             pos = (U32)(u - bacc);
                             bNb = pos / (128 KB);
-                            DISPLAY("(block %u, sub %u, pos %u) \n", segNb, bNb, pos);
+                            DISPLAY("(sample %u, block %u, pos %u) \n", segNb, bNb, pos);
                             if (u>5) {
                                 int n;
                                 for (n=-5; n<0; n++) DISPLAY("%02X ", ((const BYTE*)srcBuffer)[u+n]);