Skip to content

Conversation

@rowlesmr
Copy link
Collaborator

make it be the same as the imgCIF definition
@jamesrhester
Copy link
Contributor

This pull request brought to light something I'd forgotten: DIFFRN_DETECTOR in the mmCIF view of things has key _diffrn_detector.diffrn_id. imgCIF then goes and adds _diffrn_detector.id(I'm ignoring the variant keys as they are unused and (in my opinion) ill-conceived). So, to refer to a detector, we have to point to both _diffrn.id and _diffrn_detector.id, as the same detector id could mean different things for different _diffrn.id.

DIFFRN_RADIATION had the same problem. What we've done with DIFFRN_RADIATION is to risk incompatibilities with mmCIF if some pdbx file were to actually include more than one _diffrn.id, which can't happen, as one block can only be one _diffrn.id and pdbx blocks are always the complete dataset. DIFFRN_DETECTOR feels more important, as it is one of the foundations of 2D raw data presentation.

Some ways forward:

  1. Just keep compatibility by adding an extra data name pointing to _diffrn.id whenever we want to point to _diffrn_detector.id. If we only do it rarely, that's not a great burden.
  2. Use indirection: create a new category DIFFRN_DETECTOR_NEW that only has _diffrn_detector_new.id as key, and non key data names that point to the correct DIFFRN_DETECTOR detector. This would make sense only where we often refer to DIFFRN_DETECTOR, as it means an extra 4 definitions.
  3. Go ahead and drop _diffrn.id as a key to DIFFRN_DETECTOR (and DIFFRN_DETECTOR_AXIS,_ELEMENT) in our dictionaries, leaving it as a non-key data name that only has a value if there is a single _diffrn.id in the data set. All current pdbx or imgCIF/CBF files remain valid against both sets of dictionaries, but incompatibilities arise as soon as the wwPDB or raw data people dream up multi _diffrn.id data sets.

I'm inclined to go with (1), inelegant as it is. We only have one spot where we need to link in with DIFFRN_DETECTOR.

@rowlesmr
Copy link
Collaborator Author

Why does changing the specimen temperature change the detector, or the source, for that matter?
.

Multi _diffrn.id data sets already exist! An in situ dataset with 100 temperature steps is 100 _diffrn.id values; that's 100 sources and 100 detectors, even though nothing about them has changed.

@rowlesmr
Copy link
Collaborator Author

as the same detector id could mean different things for different _diffrn.id.

What is the use-case for that?

@rowlesmr
Copy link
Collaborator Author

rowlesmr commented Jun 28, 2025

A test data set. Just trying to reason through different ways of linking/identifying/etc everything.

#\#CIF_2.0

# Assume:
# all identical _audit_dataset.id and appropriate _audit.schema (I'm looping Set to save space)
# _diffrn_detector.id is only key
# _diffrn_radiation.id is only key
# _diffrn_radiation_wavelength.id and _diffrn_radiation_wavelength.radiation_id are only keys
# _pd_instr_detector.diffrn_detector_id is non-key and links to _diffrn_detector.id


data_diffrn_list
    loop_
    _diffrn.id  # key
    _diffrn.ambient_temperature
    _diffrn.ambient_environment
        di_100  100   vacuum
        di_150  150   vacuum
        di_200  200   vacuum

data_source
    _diffrn_radiation.id    moly_tube  # key

    loop_
    _diffrn_radiation_wavelength.id  # key
    _diffrn_radiation_wavelength.wt
    _diffrn_radiation_wavelength.value
    #_diffrn_radiation_wavelength.radiation_id   # elided key
    1   0.6533   0.709300
    2   0.3467   0.713574

data_instr
    _pd_instr.id             Instr_1  # key
    _pd_instr.geometry       fancy
    _pd_instr.radiation_id   moly_tube

    loop_
    _pd_instr_detector.id  # key
    _pd_instr.dist_spec_detc
    _pd_instr_detector.instr_id
    _pd_instr_detector.diffrn_detector_id
    A   175.0   Instr_1   q
    B   175.1   Instr_1   w
    C   174.9   Instr_1   e

data_detectors
    loop_
    _diffrn_detector.id  # key
    _diffrn_detector.description
    _diffrn_detector.dtime
    _diffrn_detector.make
    q   "small area detector"    200   "Fancy Corp"
    w   "small area detector"    300   "Fancy Corp"
    e   "small area detector"    400   "Fancy Corp"

data_raw_q_100
#...

data_raw_w_100
#...

data_raw_e_100
#...

data_raw_q_150
#...

data_raw_w_150
#...

data_raw_e_150
#...

data_raw_q_200
#...

data_raw_w_200
#...

data_raw_e_200
#...

data_diffpat_100
    _pd_diffractogram.id          unk_100  # key
    _pd_diffractogram.instr_id    Instr_1
    _pd_diffractogram.diffrn_id   di_100

    # _pd_proc_overall.diffractogram_id  # elided key
    _pd_proc.info_data_reduction
    "Fancy things to turn multiple 2D detectors into a single 1D pattern"

    loop_
    _pd_data.point_id  # key
    _pd_proc.2theta_corrected
    _pd_proc.intensity_total
    #_pd_data.diffractogram_id  # elided key
    1       3.00   1234
    # ...
    2700   29.99   1352

data_diffpat_150
    _pd_diffractogram.id          unk_150
    _pd_diffractogram.instr_id    Instr_1
    _pd_diffractogram.diffrn_id   di_150

    _pd_proc.info_data_reduction
    "Fancy things to turn multiple 2D detectors into a single 1D pattern"

    loop_
    _pd_data.point_id
    _pd_proc.2theta_corrected
    _pd_proc.intensity_total
    1       3.00   2345
    # ...
    2700   29.99   2456

data_diffpat_200
    _pd_diffractogram.id          unk_200
    _pd_diffractogram.instr_id    Instr_1
    _pd_diffractogram.diffrn_id   di_200

    _pd_proc.info_data_reduction
    "Fancy things to turn multiple 2D detectors into a single 1D pattern"

    loop_
    _pd_data.point_id
    _pd_proc.2theta_corrected
    _pd_proc.intensity_total
    1       3.00   3456
    # ...
    2700   29.99   3456

@rowlesmr
Copy link
Collaborator Author

rowlesmr commented Jun 28, 2025

An observation: mmCIF says that _diffrn.id "uniquely identifies a diffraction data set". That is obviously not true; I can collect two datasets at the same temperature/pressure/environment, and due to myriad reasons, have different results.

But looking at your suggestions:

  1. Just keep compatibility by adding an extra data name pointing to _diffrn.id whenever we want to point to _diffrn_detector.id. If we only do it rarely, that's not a great burden.

For DIFFRN_DETECTOR, we'd need to add it to PD_INSTR_DETECTOR, and for DIFFRN_RADIATION, we'd need to add it to PD_INSTR. In both cases, doesn't that mean we'd need a new set of detectors and instruments for each _diffrn.id? as well as a new source and new detector?

  1. Use indirection: create a new category DIFFRN_DETECTOR_NEW that only has _diffrn_detector_new.id as key, and non key data names that point to the correct DIFFRN_DETECTOR detector. This would make sense only where we often refer to DIFFRN_DETECTOR, as it means an extra 4 definitions.

I can't see how you can link a single unique _diffrn_detector_new.id key to multiple _diffrn_detector_new.diffrn_detector_id and _diffrn_detector_new.diffrn_id non-keys to allow lookup of a detector. If I'm just being silly, and can't see it, then this is good gets rid of the necessity to create multiple sets of PD detectors, and just requires making a single lookup table. Maybe _diffrn_detector_new.diffrn_id could be a list? Can we do drel magic here?

  1. Go ahead and drop _diffrn.id as a key to DIFFRN_DETECTOR (and DIFFRN_DETECTOR_AXIS,_ELEMENT) in our dictionaries, leaving it as a non-key data name that only has a value if there is a single _diffrn.id in the data set. All current pdbx or imgCIF/CBF files remain valid against both sets of dictionaries, but incompatibilities arise as soon as the wwPDB or raw data people dream up multi _diffrn.id data sets.

I wouldn't leave it as a non-key data name, as I think it would confuse people; "What happens in the multi _diffrn.id data set case?". I don't suppose we can get mmCIF and imgCIF to change key from *.diffrn_id to *.id? I mean, id already exists.

@rowlesmr
Copy link
Collaborator Author

Also, in mmCIF, there is only one key, _dd.diffrn_id. There is also non-key dataname _dd.id, which "must uniquely identify each detector used to collect each diffraction data set.", but how can you reference "each" detector (implying more than one), if there is only one key?

@jamesrhester
Copy link
Contributor

Why does changing the specimen temperature change the detector, or the source, for that matter?

Well, in a properly-specified world, it doesn't. I speculate that the sequence of events went as follows:

  1. Core CIF dictionary (1993). Not trying to be a relational database but accidentally creating one. Detectors are part of the setup, so they get put in a category called diffrn_detector, where 'Category' is just an organisational convenience for non-looped data.
  2. mmCIF dictionary (1998 ish). Everything is relational, all categories are looped. Therefore all categories have a key. These sorts of one-row categories got given _diffrn.id pointers cause they had to have something and it kept them in the 'setup' area as well as restricting them to single rows.
  3. imgCIF dictionary (a bit later). There could be multiple detectors in a single experiment, better create _diffrn_detector.id and make it a key.

Multi _diffrn.id data sets already exist! An in situ dataset with 100 temperature steps is 100 _diffrn.id values; that's 100 sources and 100 detectors, even though nothing about them has changed.

Well exactly. Which would be our argument if we wanted to convince the mmCIF people to drop _diffrn.id from these categories.

as the same detector id could mean different things for different _diffrn.id.

What is the use-case for that?

None whatsoever, I'd guess. It would be pretty obnoxious to switch around or generate new detector ids for different _diffrn.ids. I'm just saying that this is what the current mmCIF definitions permit, and therefore properly-written software should always check both _diffrn.id and _diffrn_detector.id when following detector links.

Also, in mmCIF, there is only one key, _dd.diffrn_id. There is also non-key dataname _dd.id, which "must uniquely identify > each detector used to collect each diffraction data set.", but how can you reference "each" detector (implying more than > one), if there is only one key?

Because _dd.id was defined to be a key in imgCIF. I'm not sure that the wwPDB would be able to handle a submission that included a full-blown imgCIF detector description (it appears _dd.id is used in 0.001% of wwPDB entries, which would be about 3). imgCIF data names are used in the (relatively few) imgCBF single-frame files that are archived as raw data (not at the wwPDB).

@jamesrhester
Copy link
Contributor

An observation: mmCIF says that _diffrn.id "uniquely identifies a diffraction data set". That is obviously not true; I can collect two datasets at the same temperature/pressure/environment, and due to myriad reasons, have different results.

Yes, what _diffrn.id refers to is defined by what data names belong to categories that it's a key of. The definition for the DIFFRN category itself is less all-encompassing.

But looking at your suggestions:

  1. Just keep compatibility by adding an extra data name pointing to _diffrn.id whenever we want to point to _diffrn_detector.id. If we only do it rarely, that's not a great burden.

For DIFFRN_DETECTOR, we'd need to add it to PD_INSTR_DETECTOR, and for DIFFRN_RADIATION, we'd need to add it to PD_INSTR. In both cases, doesn't that mean we'd need a new set of detectors and instruments for each _diffrn.id? as well as a new source and new detector?

Just thinking this through: our 2D raw data for each measurement would be accessed via _diffrn.id + _diffrn_detector.id in any case (because obviously the data on each detector are different for different _diffrn.id). To find the raw data for a given diffractogram, we combine the diffractogram _pd_diffractogram.diffrn_id with the instr -> instr_detector -> (_pd_instr_detector.detector_id, _pd_instr_detector.diffrn_id) tuple to look up the data, with no requirement that the _pid.diffrn_id is the current _diffrn.id as it is not a key. So _pid.diffrn_id can just be e.g. the first _diffrn.id and doesn't need to be restated for every measurement. The data producer controls the naming of detectors so can ensure this is the case.

Where things blow up is in the 2D data presentation (out of the control of pdCIF). All information in categories starting with DIFFRN will need to be repeated for every _diffrn.id. This repetition is not that surprising for imgCIF - see an example here. 100 _diffrn.ids compares favourably to 3600 image frames.

  1. Use indirection: create a new category DIFFRN_DETECTOR_NEW that only has _diffrn_detector_new.id as key, and non key data names that point to the correct DIFFRN_DETECTOR detector. This would make sense only where we often refer to DIFFRN_DETECTOR, as it means an extra 4 definitions.

I can't see how you can link a single unique _diffrn_detector_new.id key to multiple _diffrn_detector_new.diffrn_detector_id and _diffrn_detector_new.diffrn_id non-keys to allow lookup of a detector. If I'm just being silly, and can't see it, then this is good gets rid of the necessity to create multiple sets of PD detectors, and just requires making a single lookup table. Maybe _diffrn_detector_new.diffrn_id could be a list? Can we do drel magic here?

It's just the same as the previous one. pdCIF items just refer to DET1 always. It's like the diffrn_detector_new category is a way of formally guaranteeing that the referenced detector does not change, instead of example 1 above where the data producer is expected to follow that guideline (which can be checked, of course).

_diffrn_detector_new.id DET1
_diffrn_detector_new.diffrn_id 100C
_diffrn_detector_new.detector_id OLD_DET1

loop_
_diffrn_detector.diffrn_id
_diffrn_detector.detector_id
100C   OLD_DET1
200C   OLD_DET1 #unused
300C   OLD_DET1 #unused
400C   OLD_DET1 #unused

_pd_instr_detector.detector_id DET1
 
  1. Go ahead and drop _diffrn.id as a key to DIFFRN_DETECTOR (and DIFFRN_DETECTOR_AXIS,_ELEMENT) in our dictionaries, leaving it as a non-key data name that only has a value if there is a single _diffrn.id in the data set. All current pdbx or imgCIF/CBF files remain valid against both sets of dictionaries, but incompatibilities arise as soon as the wwPDB or raw data people dream up multi _diffrn.id data sets.

I wouldn't leave it as a non-key data name, as I think it would confuse people; "What happens in the multi _diffrn.id data set case?". I don't suppose we can get mmCIF and imgCIF to change key from *.diffrn_id to *.id? I mean, id already exists.

We could deprecate dd.diffrn_id but we can't make data names go away. It would be quite a lot of discussion, not necessarily with the outcome we want, to try to improve mmCIF like this. On the bright side, all known data files would be equally valid if _diffrn.id is not a key as there is only ever a single _diffrn.id in a PDB deposition (they don't formally deal with multi-block files AFAIK).

If we do want to broach this, it won't be resolved before VolG is published, pretty sure. We'd be best placed for a future where the wwPDB + imgCIF do accept dropping _diffrn.id as a key if we go with option 1 - in that case we can just deprecate + ignore _pid.diffrn_id, whereas option 2 bakes all those new data names into software essentially forever.

@rowlesmr
Copy link
Collaborator Author

I want to be future-proofed. So I'll write up option 1 and see how it works.

My idea wouldn't deprecate _dd.diffrn_id per se, just make it non-key, and promote _dd.id to key.

@rowlesmr
Copy link
Collaborator Author

Just thinking this through: our 2D raw data for each measurement would be accessed via _diffrn.id + _diffrn_detector.id in any case (because obviously the data on each detector are different for different _diffrn.id). To find the raw data for a given diffractogram, we combine the diffractogram _pd_diffractogram.diffrn_id with the instr -> instr_detector -> (_pd_instr_detector.detector_id, _pd_instr_detector.diffrn_id) tuple to look up the data, with no requirement that the _pid.diffrn_id is the current _diffrn.id as it is not a key. So _pid.diffrn_id can just be e.g. the first _diffrn.id and doesn't need to be restated for every measurement. The data producer controls the naming of detectors so can ensure this is the case.

How does it handle multiple datasets with the same _diffrn.id?

I know I've done this in the past: Collect 10 diffractograms at a constant temperature, ramp up one step, collect 10 diffractograms, and so on. The idea was to look at the kinetics of the changes at each temperature, as well as what changes occur with temperature.

@jamesrhester
Copy link
Contributor

How does it handle multiple datasets with the same _diffrn.id?

I know I've done this in the past: Collect 10 diffractograms at a constant temperature, ramp up one step, collect 10 diffractograms, and so on. The idea was to look at the kinetics of the changes at each temperature, as well as what changes occur with temperature.

imgCIF has the concept of "scans" and "frames". A scan is a series of measurements ('frames'). There can be multiple scans for each _diffrn.id. In the single crystal context, this is like step-rotating the crystal while measuring, then changing the orientation and rotating again. Should be overkill for our case. It does mean that _pd_diffractogram will want to point to _diffrn_scan.scan_id and _diffrn_scan_frame.frame_id if it is derived from a 2D measurement and there is more than one scan + frame.

@jamesrhester
Copy link
Contributor

I want to be future-proofed. So I'll write up option 1 and see how it works.

My idea wouldn't deprecate _dd.diffrn_id per se, just make it non-key, and promote _dd.id to key.

_dd.id is a key according to imgCIF (not mmCIF). We definitely want it to be a key and if it's good enough for imgCIF it's good enough for us.

@rowlesmr
Copy link
Collaborator Author

rowlesmr commented Jun 30, 2025

So I'll write up option 1

  1. Just keep compatibility by adding an extra data name pointing to _diffrn.id whenever we want to point to _diffrn_detector.id. If we only do it rarely, that's not a great burden.

So, in multiblock, there will be DIFFRN_DETECTOR and DIFFRN_RADIATION. Both will have .id as key and .diffrn_id as non key data names.

There will exist pd_instr.radiation_id linked to _dr.id, and _pd_instr_detector.diffrn_detector_id linked to _dd.id. Will also need pd_instr.diffrn_id and _pd_instr_detector.diffrn_id linked to _diffrn.id

For completeness, PD_DIFFRACTOGRAM should also have _pd.diffrn_scan_id and _pd.diffrn_scan_frame_frame_id linked to _diffrn_scan.id and _diffrn_scan_frame.frame_id from imgCIF (or should that last be linked to _diffrn_data_frame.id, given that _dsf.frame_id is linked to that...). If we do this last paragraph, we will also need to import the imgCIF dictionary.

@rowlesmr
Copy link
Collaborator Author

rowlesmr commented Jun 30, 2025

I think there will need to be an example of use, or at least an expanded description on how the *.diffrn_id values are to be used.

Where I've added them to _pd_instr.diffrn_id and _pd_instr.diffrn_id, I've added an additional description text:

Included to maintain compatibility with imgCIF. Do not use this value.
Use the value associated with _pd_diffractogram.diffrn_id in any
lookup associated with DIFFRN_DETECTOR|RADIATION.

@rowlesmr
Copy link
Collaborator Author

For completeness, PD_DIFFRACTOGRAM should also have _pd.diffrn_scan_id and _pd.diffrn_scan_frame_frame_id linked to _diffrn_scan.id and _diffrn_scan_frame.frame_id from imgCIF (or should that last be linked to _diffrn_data_frame.id, given that _dsf.frame_id is linked to that...). If we do this last paragraph, we will also need to import the imgCIF dictionary.

or can we just define these in pdCIF/Multiblock without reference to imgCIF?

@jamesrhester
Copy link
Contributor

For completeness, PD_DIFFRACTOGRAM should also have _pd.diffrn_scan_id and _pd.diffrn_scan_frame_frame_id linked to _diffrn_scan.id and _diffrn_scan_frame.frame_id from imgCIF (or should that last be linked to _diffrn_data_frame.id, given that _dsf.frame_id is linked to that...). If we do this last paragraph, we will also need to import the imgCIF dictionary.

or can we just define these in pdCIF/Multiblock without reference to imgCIF?

We would be redefining entire categories, as DIFFRN_SCAN is a category, then DIFFRN_SCAN_FRAME, and then a few others as well. So we should just add an import of imgCIF to the top of the dictionary, making sure that it comes before the multiblock dictionary so that the multiblock dictionary determines the key data names.

@rowlesmr
Copy link
Collaborator Author

rowlesmr commented Jul 1, 2025

We would be redefining entire categories, as DIFFRN_SCAN is a category, then DIFFRN_SCAN_FRAME, and then a few others as well. So we should just add an import of imgCIF to the top of the dictionary, making sure that it comes before the multiblock dictionary so that the multiblock dictionary determines the key data names.

Will PR that over in PD

@jamesrhester jamesrhester merged commit 7f298e6 into COMCIFS:main Jul 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants