Role of whole-body MRI for treatment response assessment in multiple myeloma: comparison between clinical response and imaging response

Background Whole-body MRI (WB-MRI) including diffusion-weighted image (DWI) have been widely used in patients with multiple myeloma. However, evidence for the value of WB-MRI in the evaluation of treatment response remains sparse. Therefore, we evaluated the role of WB-MRI in the response assessment. Methods In our WB-MRI registry, we searched multiple myeloma patients treated with chemotherapy who underwent both baseline and follow-up WB-MRI scans. Clinical responses were categorized as complete response (CR), partial response (PR), stable disease (SD), or progressive disease (PD), using IMWG criteria. Using RECIST 1.1, MD Anderson (MDA) criteria, and MDA-DWI criteria, imaging responses on WB-MRI were rated as CR, PR, SD, or PD by two radiologists independently. Then, discrepancy cases were resolved by consensus. Weighted Kappa analysis was performed to evaluate agreement between the imaging and clinical responses. The diagnostic accuracy of image responses in the evaluation of clinical CR, objective response (CR and PR), and PD was calculated. Results Forty-two eligible patients were included. There was moderate agreement between imaging and clinical responses (κ = 0.54 for RECIST 1.1, κ = 0.58 for MDA criteria, κ = 0.69 for MDA-DWI criteria). WB-MRI showed excellent diagnostic accuracy in assessment of clinical PD (sensitivity 88.9%, specificity 94.7%, positive predictive value [PPV] 84.2%, negative predictive value [NPV] 96.4% in all three imaging criteria). By contrast, WB-MRI showed low accuracy in assessment of clinical CR (sensitivity 4.5%, specificity 98.1%, PPV 50.0%, NPV 71.2% in all three imaging criteria). As to the clinical objective response, the diagnostic accuracy was higher in MDA-DWI criteria than RECIST 1.1 and MDA criteria (sensitivity/specificity/PPV/NPV, 84.2%/94.4%/98.0%/65.4, 54.4%/100%/100%/40.9, and 61.4%/94.4%/97.2%/43.6%, respectively). Conclusions In the imaging response assessment of multiple myeloma, WB-MRI showed excellent performance in the evaluation of PD, but not in the assessment of CR or objective response. When adding DWI to imaging response criteria, diagnostic accuracy for objective response was improved and agreement between imaging and clinical responses was increased.


Background
Recent advances have changed the role of imaging modalities in the diagnosis and staging of multiple myeloma (MM) [1][2][3][4][5][6][7][8][9]. Whole-body magnetic resonance imaging (WB-MRI) is of particular note because of its excellent soft tissue contrast, which allows evaluation of the bone marrow space, and has improved the prognosis of MM by facilitating the early detection of bone marrow lesions [10][11][12][13][14][15][16]. However, the use of WB-MRI as a response assessment tool is still the subject of debate. Despite its high diagnostic performance, WB-MRI is subject to the limitation that it may show persistent non-viable lesions after treatment [5,17,18].
In 2015, the International Myeloma Working Group (IMWG) made a consensus statement on the role of WB-MRI for MM [5]. In terms of response assessment, it states "MRI might help in the better definition of CR," and also "the systemic performance of MRI for the follow-up of patients in the absence of clinical indications is not recommended" [5]. However, the consensus statement itself was evidence level D (panel consensus). Indeed, there have been only a few small studies evaluating the diagnostic performance of WB-MRI in response assessment [19][20][21][22][23], and, surprisingly, no study has evaluated the role of WB-MRI in differentiating progressive disease (PD) from non-PD, which is one of the most important issues in clinical oncology.
The recently updated NCCN guideline version 3.2019 also used the rather vague term "when clinically indicated" to describe the indication for WB-MRI during follow-up after primary therapy [24]. The phrase "clinically indicated" might mean PD on clinical assessment. However, evidence for the exact relationship between radiological response and clinical response is also sparse, indicating the need for further study to clarify the role of WB-MRI.
Furthermore, in respect to multiple myeloma, there remains another critical issue with imaging criteria that has not yet been fully addressed. In terms of anatomic imaging response criteria, RECIST 1.1 is currently regarded as the standard assessment tool [25]. However, RECIST 1.1 only focuses on solid tumor, which raises an issue as to whether it is relevant or not for use in MM, which is a hematologic malignancy that mainly presents with lytic lesions in bone marrow [11,26,27]. Thus, the MD Anderson (MDA) criteria have been proposed for response assessment in bone metastases, including in multiple myeloma [28]. The role of RECIST 1.1 and MDA in the assessment of treatment response in multiple myeloma has not been evaluated yet. Regarding functional imaging, recent studies have proposed the role diffusion-weighted image in response evaluation and suggested its use in clinical practice [2,16,17,29]. Several studies have shown that increase in ADC value during treatment is correlated with good response [21,29,30]. However, no study has elucidated the added benefit of DWI by comparing anatomic imaging response criteria with DWI and without DWI.
To this end, we performed this study to evaluate the agreements and discrepancies between imaging response and clinical response in MM. In addition, we aimed to evaluate the diagnostic performance of WB-MRI for assessing treatment response, with the clinical response being set as the reference standard. For the imaging response

Methods
This retrospective study was approved by our institutional review board, and the requirement for written informed consent was waived. The STARD 2015 guideline for reporting diagnostic performance was followed.

Patients
Our institution started to perform WB-MRI for multiple myeloma patients in March 2015 and established a WB-MRI registry. This registry was searched for eligible patients according to the inclusion and exclusion criteria. The inclusion criteria were as follows: [1] patients who were diagnosed with multiple myeloma between March 2015 and March 2018 at our institution [2]; patients who underwent systemic chemotherapy with or without autologous stem cell transplantation (ASCT); and [3] patients who underwent at least two consecutive WB-MRI, one at baseline and one during follow-up. The exclusion criteria included: [1] patients with no evaluable disease according to the IMWG criteria, and [2] the presence of other known malignancy.

Image acquisition
The WB-MRI scans evaluated in this study were performed on the same 3-T unit (Skyra, Siemens Healthineers), using a 16-channel head and neck coil, a 64channel body coil, and four body surface coils. A multiplestation acquisition was implemented to cover the full body (vertex to feet). The full scanning sequences and their parameters are summarized in Table 1.

Medical information review
The patients' medical information was reviewed from their medical records, including treatment history, clinical stability, and laboratory test results. Medical information for the evaluation of clinical response was recorded in all patients at each MR visit. The time interval between recording of the medical information and the WB-MRI scanning was less than 2 weeks in all cases. The laboratory test results included plasma cell percentage on bone marrow biopsy, serum and urine Mprotein, serum-free light chain (FLC) ratio, absolute FLC difference, beta-2 microglobulin, albumin, calcium, creatinine, and hemoglobin level.

Clinical response assessment
A clinical researcher (K.L., general physician with 1 year of experience) evaluated the clinical responses of the patients at each MR visit using the IMWG criteria. The IMWG criteria are based on changes in M-protein as a major indicator for response assessment. The response categories include complete response (CR), very good partial response (VGPR), partial response (PR), minimal response (MR), stable disease (SD), and progressive disease (PD) [1]. Detailed explanations of each response category are provided in Additional file 1: Table S1. To analyze the associations between clinical and imaging responses, the clinical responses were simplified to seroCR, seroPR (VGPR+PR + MR), ser-oSD, and seroPD, where the prefix 'sero' indicates IMWG criteria based solely on laboratory data, excluding any imaging components.

Imaging response assessment
Five MR imaging patterns of marrow involvement have been described in multiple myeloma [3]: [1] normal appearance of the bone marrow despite minor microscopic plasma cell infiltration, [2] focal involvement, [3] homogeneous diffuse infiltration, [4] combined diffuse and focal infiltration, and [5] 'salt and pepper' pattern with inhomogeneous bone marrow with interposition of fat islands.
Two radiologists (K.W.K., 13 years of experience; H.Y.P., 4 years of experience) independently evaluated the imaging response at each MR visit using the RECIST 1.1, MDA criteria, and MDA-DWI criteria, without knowledge of the clinical response. A web-based electronic case report form (eCRF) was used for the independent blind reviews. If a discrepancy was detected between the two radiologists, a consensus was made through image review. The discrepancy rate was calculated and the reasons for the discrepancy were recorded in the eCRF form.
While performing the imaging response assessment, focal involvement of bone marrow was considered as the target lesion in most cases because it is easy to measure the size of such lesions. Other bone marrow involvement patterns such as diffuse or 'salt and pepper' patterns were regarded as non-target lesions.

RECIST 1.1
According to RECIST 1.1, a target lesion for soft tissue is a well-defined lesion ≥10 mm in the longest axis, while for lymph node it is ≥15 mm in the shortest axis [25]. The largest sum of diameters (SoD) of five target lesions is evaluated, with a maximum of two lesions per organ [25]. RECIST 1.1 contains the four response categories of CR, PR, PD, and SD. The detailed response evaluations are summarized in Additional file 1: Table S2. In this study, the two maximal focal bone marrow lesions exceeding 10 mm in size were counted as target lesions. Extramedullary soft tissue myelomas, if present, were measured in an identical manner. Non-target lesions were also evaluated at every MR visit.

MD Anderson criteria
The MDA criteria are bone-specific response criteria developed by The University of Texas MD Anderson Cancer Center in 2004 for evaluation of bone metastases [28,31]. They include objective measurement of the perpendicular diameters of measurable lesions and subjective evaluation of unmeasurable lesions [28]. The MDA criteria divides responses into the four categories of CR, PR, PD, and SD [28]. We adjusted the MDA criteria for response evaluation in multiple myeloma, choosing the two maximal target lesions in the bone marrow, and regarded all remaining bone marrow lesions as non-target lesions. If there were discrepancies between the responses of target and non-target lesions, the responses of the lesions that represented the bulk of the disease were followed. The detailed response criteria are described in Additional file 1: Table S3.

MDA-DWI criteria
To create imaging response assessment criteria based on both anatomic and functional imaging criteria, we modified MDA criteria by adding DWI criteria (MDA-DWI). Details of the DWI criteria were adopted from a recently proposed structured reporting tool for multiple myeloma, Myeloma Response Assessment and Diagnosis System (MY-RADS) [16].
Modified statements of the DWI criteria are presented in Additional file 1: Table S3. To obtain ADC value, ROI was drawn manually on high b-value image for maximal two target lesions which were selected using MDA criteria. In patient with diffuse involvement pattern, ROI was drawn in the bone region with signal alteration (i.e., vertebral body or pelvic bone). Mean ADC values of ROIs were calculated and compared between MR visits.

Statistical analysis
The primary outcome was the agreement between the clinical response and imaging responses, which was evaluated using weighted Kappa. The secondary outcomes included the reliability of the imaging response assessment with regard to inter-reader agreement and intercriteria agreement, evaluated using weighted Kappa. The diagnostic accuracies of WB-MRI (sensitivity, specificity, positive predictive value, negative predictive value) in the evaluation of clinical CR, objective response (CR and PR), and PD were calculated. SPSS (version 23, IBM) and Medcalc software (version 14.8.1, Medcalc, Mariakerke, Belgium) were used for the analysis, with a significance level of p = 0.05 indicating statistical significance.  excluded because they underwent only a single MR examination, either at baseline or during follow-up. We also excluded four patients who underwent baseline WB-MRI at outside hospitals, and another eight patients with either no evaluable disease according to IWMG criteria or WB-MRI, or the presence of other known malignancies. Finally, 42 consecutive patients who met the inclusion and exclusion criteria were enrolled. The patient demographics are shown in Table 2. The majority of the patients (n = 38; 90.4%) were treated with bortezomib-based systemic chemotherapy, either with VTD (bortezomib + thalidomide + dexamethasone, n = 28) or VMP (bortezomib + melphalan + prednisolone, n = 10). The rest of the patients were treated with TD (thalidomide + dexamethasone, n = 3) or DCEP (dexamethasone + cyclophosphamide + etoposide + cisplatin, n = 1). The distributions of the imaging patterns were as follows: focal (n = 25; 59.5%), diffuse (n = 5; 11.9%), focal and diffuse (n = 5; 11.9%), and salt and pepper pattern (n = 7; 16.7%).

Agreements between imaging and clinical responses
Agreement between imaging and clinical responses was moderate when using RECIST 1.1 and MDA (κ = 0.54 and 0.58, respectively) and increased when using MDA-DWI (κ = 0.69). Cross-tabulations between the imaging and clinical responses are shown in Table 3. Notably, of the 20 cases with imaging SD on MDA criteria, 13 cases were changed to PR on MDA-DWI criteria.
In patients with seroCR and seroPR, the anatomic imaging criteria (RECIST 1.1 and MDA) showed a tendency to underestimate the response in comparison with the clinical response. Indeed, 95.4% (21 out of 22) of patients with a seroCR and 45.7% (16 out of 35) of patients with a seroPR were underestimated in the imaging response using both RECIST 1.1 and MDA criteria. When using MDA-DWI criteria, the percentage of underestimated cases was decreased in patients with ser-oPR (20%, 7 out of 35), while that in patients with ser-oCR was unchanged (95.4%, 21 out of 22). In patients with seroPD, agreement with imaging PD was higher than patients with sero CR/PR. In all three criteria, four patients showed clinically significant discrepancies between clinical and imaging responses which could impact the treatment plan. A representative case is shown in Fig. 2. Brief histories of these four patients are summarized in Additional file 1: Table S4. Table 4 summarizes the performance of WB-MRI in the evaluation of treatment response in MM, with the clinical responses as the reference standard. WB-MRI showed high diagnostic accuracy in the assessment of clinical PD, with a sensitivity of 88.9%, specificity of 94.7%, positive predictive value (PPV) of 84.2%, and negative predictive value (NPV) of 96.4% in all three imaging criteria (Fig. 3). By contrast, WB-MRI showed low accuracy in assessment of clinical CR (sensitivity 4.5%, specificity 98.1%, PPV 50.0%, NPV 71.2% in all three imaging criteria). Regarding the clinical objective response, the diagnostic accuracy was increased in MDA-DWI criteria compared to RECIST 1.1 and MDA criteria (sensitivity 54.4%, specificity 100%, PPV 100%, NPV 40.9% for RECIST 1.1; sensitivity 61.4%, specificity 94.4%, PPV 97.2%, NPV 43.6% for MDA criteria; sensitivity 84.2%, specificity 94.4%, PPV 98.0%, NPV 65.4% for MDA-DWI criteria) (Fig. 4).

Reliability of imaging response criteria
Analysis of inter-reader agreement showed strong agreement between the two readers in all three imaging criteria (κ = 0.82 for RECIST 1.1; κ = 0.82 for MDA criteria, κ = 0.85 for MDA-DWI criteria). The inter-reader discrepancy rates were 20.0% with RECIST 1.1, 18.7% with the MDA criteria, and 13.3% with MDA-DWI criteria. All discrepant cases were due to either different target lesion selections or discrepancy in the subjective evaluation of the tumor burden of non-target lesions.
In terms of the inter-criteria agreement, strong agreement was noted in all three imaging criteria (κ = 0.86 between the RECIST 1.1 and MDA criteria; κ = 0.85 between RECIST 1.1 and MDA-DWI criteria; κ = 0.89 between MDA criteria and MDA-DWI criteria). The discrepancy rate between RECIST 1.1 and MDA criteria was 20.0%, with the discrepancies being mainly due to non-target responses: the MDA criteria can differentiate PR and SD for non-target lesions, while RECIST 1.1 regards PR and SD as non-CR/non-PD as a whole. The discrepancy rate between MDA and MDA-DWI criteria was 17.3% due to the persistent bone marrow lesions despite decreased diffusion restriction.

Discussion
This study investigated the relationship between imaging response and clinical response and the diagnostic accuracy of WB-MRI as a response assessment tool in MM. We demonstrated moderate agreement between clinical and anatomic imaging criteria (κ = 0.54 for RECIST 1.1, κ = 0.58 for MDA criteria), with WB-MRI having a tendency to underestimate CR or PR compared with the clinical responses. When diffusion weight image was combined, agreement between clinical and imaging response was improved (κ = 0.69 for MDA-DWI criteria). Our results are in general agreement with the 2015 IMWG consensus statement and several previous studies, in that WB-MRI may show persistent non-viable lesions in bone marrow after treatment [5,17,19,32].
Compared with previous studies, this study showed WB-MRI to have low performance in the diagnosis of CR or PR. Indeed, on WB-MRI, bone marrow signal abnormalities were documented as being resolved in only one out of 22 scans that were regarded as clinical CR; we were able to detect residual lesions in bone marrow in the other 21 scans. Most of the lesions decreased in size after treatment, but a few showed no size change. In the majority of cases, the internal signal characteristics and enhancement pattern altered after treatment, with the T2 signal intensity and ADC value of the lesions being increased at follow-up and the enhancement pattern changing from solid to peripheral enhancement.
Our study showed excellent diagnostic performance of WB-MRI in the evaluation of clinical PD (sensitivity, 88.9%; specificity, 94.7% for all three imaging criteria), but poor diagnostic performance in the assessment of clinical CR (sensitivity, 4.5%; specificity, 98.1% for all three imaging criteria). This finding supports the consensus statement made by the IMWG that "routine WB-MRI is not recommended for the evaluation of treatment response". However, when adding DWI to anatomic WB-MRI, the diagnostic accuracy of clinical objective response (CR + PR) was compared to those of RECIST 1.1 and MDA criteria. Indeed, of the 20 cases with imaging SD on MDA criteria, 13 cases were changed to PR on MDA-DWI criteria, suggesting benefit of functional imaging when assessing persistent bone marrow lesions on anatomic imaging. These findings are in concordance with previous studies that WB-DWI can be a potential tool in treatment response evaluation of bone marrow lesions of multiple myeloma [2,16,[33][34][35].
Four patients showed notable discrepancies between imaging and clinical responses. Two patients were evaluated as PD on imaging despite a clinical PR or CR. The other two patients were clinical PD, despite showing a PR or SD on imaging. Interestingly, in one patient with PD on imaging, the laboratory markers including M-protein, FLC ratio, and absolute FLC difference continued to fluctuate, while the imaging features and the patient's bone pains worsened. The patient was managed according to the imaging response, and the chemotherapy regimen was changed. This case suggests that WB-MRI can be used to confirm PD when there is ambiguity in the clinical response or clinical suspicion of disease progression. This result is in concordance with the NCCN guideline and IMWG consensus statement that WB-MRI should be used when clinically indicated.
In our study, strong agreement was noted for interreader (κ = 0.82-0.85) and inter-criteria agreement (κ = 0.85-0.89). Importantly, there were no discrepancies in the evaluation of PD. Those inter-reader discrepancy cases found were due to different target lesion selections or discrepancy in the subjective evaluation of the tumor burden of non-target lesions. The discrepancy cases between RECIST 1.1 and MDA criteria were mainly due to non-target responses; the MDA criteria can differentiate PR and SD for non-target lesions, while RECIST 1.1 regards PR and SD as non-CR/non-PD as a whole. The discrepancy cases between MDA and MDA-DWI criteria were due to the persistent bone marrow lesions with variable diffusion restriction. However, as all the discrepancy cases were PR or SD, inter-reader and inter-criteria discrepancies did not impact on clinical decisions. Our study suggests that MDA-DWI criteria might be better in treatment response assessment in multiple myeloma than anatomic imaging criteria alone.
Our study has several limitations. First, this was a retrospective study, and therefore the patients showed variable intervals between MR exams. This may have impaired the strength of our results. A further prospective study design with regular MR intervals should be performed to acquire more solid data. Second, we used the RECIST 1.1, MDA criteria, and MDA-DWI criteria to evaluate imaging response, but neither of these are wellvalidated criteria for multiple myeloma. As the MDA criteria is bone-specific, we needed to modify the criteria to incorporate evaluation of extramedullary myeloma involvement. Further study is needed to validate th reporting system and to evaluate its clinical impact on multiple myeloma.

Conclusion
Imaging response assessment using WB-MRI showed excellent performance in the evaluation of disease progression, but not in the assessment of complete response or objective response. When adding DWI to imaging response criteria, diagnostic accuracy for objective response was improved and agreement between imaging and clinical responses was increased. Fig. 4 A 56-year-old male with multiple myeloma demonstrating additional benefit of DWI in the evaluation of clinical objective response. MRI taken before chemotherapy (a) shows a 4.0 × 3.1 cm enhancing mass at posterior arc of left 9th rib. Axial DWI (b = 900) and ADC map show marked diffusion restriction in the lesion (mean ADC value: 0.54 × 10 − 3 mm 2 /s). Coronal diffusion MIP image (b = 900) shows diffuse high signal intensity involving whole axial skeletons, suggesting diffuse bone marrow involvement. After 4 cycles of chemotherapy, follow-up MRI (b) shows equivocal change in the tumor size (3.7 × 2.3 cm) in the left 9th rib. Axial DWI (b = 900) and ADC map show marked improvement of diffusion restriction in the lesion (mean ADC value: 1.54 × 10 − 3 mm 2 /s). Coronal diffusion MIP image (b = 900) demonstrates decreased signal intensity in the axial skeletons, suggesting good response to the treatment. Based on RECIST 1.1 or MDA criteria, this patient is classified as imaging SD. However, when DWI findings are considered in MDA-DWI criteria, this patient is classified into imaging PR