Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assisted installer service getting error - chronyc: error while loading shared libraries: libnettle.so.8 #385

Open
pdfruth opened this issue Jun 26, 2022 · 8 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@pdfruth
Copy link

pdfruth commented Jun 26, 2022

I'm using the self-hosted assisted installer service to install Single Node OKD.
The assisted installer service is running in podman containers, as documented here

This method of doing a single node install of OKD used to work. But, has started to fail recently (within the last 30 days or so).

The host registers with the installer service, but gets stuck on an NTP synchronization failure as seen in the attached screen-shot

Screen Shot 2022-06-25 at 5 41 06 PM

Looking into the pod logs of the assisted installer service, I see this message;

level=error msg="Received step reply <ntp-synchronizer-392f0f02> from infra-env <ff4ce4b9-a3cd-4c50-b258-24cfbba8d1e3> host <68b15b04-5cb1-429f-9778-3c8727d0235d> exit-code <-1> stderr <chronyc exited with non-zero exit code 127: \nchronyc: error while loading shared libraries: libnettle.so.8: cannot open shared object file: No such file or directory\n> stdout <>" func=github.com/openshift/assisted-service/internal/bminventory.logReplyReceived file="/go/src/github.com/openshift/origin/internal/bminventory/inventory.go:2992" go-id=9762 host_id=68b15b04-5cb1-429f-9778-3c8727d0235d infra_env_id=ff4ce4b9-a3cd-4c50-b258-24cfbba8d1e3 pkg=Inventory request_id=6a4edac8-f290-4cb2-813e-f6a67ef9c50b

The relevant part of the message being - chronyc: error while loading shared libraries: libnettle.so.8: cannot open shared object file: No such file or directory

I believe the root cause for this is due to the changes introduced by this commit

The code change introduced by that commit mounts the chronyc command binary of the underlying OS (on which the assisted-installer-agent container runs on) into the /usr/bin directory inside the container. In my particular instance that host OS is Fedora CoreOS 35.20220327.3.0. The problem, in this case, is that the chronyc command is a dynamically linked ELF that depends on the libnettle.so.8 shared library... which isn't present in the container. The container does contain libnettle.so.6 tho.

Anyway, IMO this [bind-mounting the chronyc command from the underlying OS] is a containers anti-pattern.

Wouldn't it be a better approach to use the chronyc installed by the dnf install chrony in the docker file here, used to build the assisted installer agent container image.

@tsorya could you have a look at the change introduced in that commit. This introduces a significant pre-req of same shared library (that which the chronyc binary is dynamically linked) also be present on the assisted installer agent container image. Is there a different approach?

@pdfruth
Copy link
Author

pdfruth commented Jun 26, 2022

In the mean time, I've been able to work around the error by explicitly setting AGENT_DOCKER_IMAGE: quay.io/edge-infrastructure/assisted-installer-agent:v2.4.1 when customizing the sample okd-config.yml file here
Note: v2.4.1 is the version of the image just prior to the introduction of the commit that introduced the problem mentioned above.

For example, here is an okd-configmap.yml that works for me today;

apiVersion: v1
kind: ConfigMap
metadata:
  name: config
data:
  ASSISTED_SERVICE_HOST: 192.168.10.2:8090
  ASSISTED_SERVICE_SCHEME: http
  AUTH_TYPE: none
  DB_HOST: 127.0.0.1
  DB_NAME: installer
  DB_PASS: admin
  DB_PORT: "5432"
  DB_USER: admin
  DEPLOY_TARGET: onprem
  DISK_ENCRYPTION_SUPPORT: "false"
  DUMMY_IGNITION: "false"
  ENABLE_SINGLE_NODE_DNSMASQ: "false"
  HW_VALIDATOR_REQUIREMENTS: '[{"version":"default","master":{"cpu_cores":4,"ram_mib":16384,"disk_size_gb":100,"installation_disk_speed_threshold_ms":10,"network_latency_threshold_ms":100,"packet_loss_percentage":0},"worker":{"cpu_cores":2,"ram_mib":8192,"disk_size_gb":100,"installation_disk_speed_threshold_ms":10,"network_latency_threshold_ms":1000,"packet_loss_percentage":10},"sno":{"cpu_cores":8,"ram_mib":16384,"disk_size_gb":100,"installation_disk_speed_threshold_ms":10}}]'
  IMAGE_SERVICE_BASE_URL: http://192.168.10.2:8888
  IPV6_SUPPORT: "true"
  LISTEN_PORT: "8888"
  NTP_DEFAULT_SERVER: ""
  POSTGRESQL_DATABASE: installer
  POSTGRESQL_PASSWORD: admin
  POSTGRESQL_USER: admin
  PUBLIC_CONTAINER_REGISTRIES: 'quay.io'
  SERVICE_BASE_URL: http://192.168.10.2:8090
  STORAGE: filesystem
  OS_IMAGES: '[{"openshift_version":"4.10","cpu_architecture":"x86_64","url":"https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/35.20220327.3.0/x86_64/fedora-coreos-35.20220327.3.0-live.x86_64.iso","rootfs_url":"https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/35.20220327.3.0/x86_64/fedora-coreos-35.20220327.3.0-live-rootfs.x86_64.img","version":"35.20220327.3.0"}]'
  RELEASE_IMAGES: '[{"openshift_version":"4.10","cpu_architecture":"x86_64","url":"quay.io/openshift/okd:4.10.0-0.okd-2022-06-10-131327","version":"4.10.0-0.okd-2022-06-10-131327","default":true}]'
  OKD_RPMS_IMAGE: quay.io/vrutkovs/okd-rpms:4.10
  AGENT_DOCKER_IMAGE: quay.io/edge-infrastructure/assisted-installer-agent:v2.4.1

@tsorya
Copy link
Contributor

tsorya commented Jun 26, 2022 via email

@omertuc
Copy link
Contributor

omertuc commented Jun 26, 2022

Anyway, IMO this [bind-mounting the chronyc command from the underlying OS] is a containers anti-pattern.
Wouldn't it be a better approach to use the chronyc installed by the dnf install chrony in the docker file here, used to build the assisted installer agent container image.

It's not so simple as chronyc inside the agent container is communicating through a UDS socket mount with the host's operating system's non-containerized chronyd daemon, and so we're just moving the problem from "Host<->container shared library incompatibilities" to "Chronyc<->Chronyd socket API across versions incompatibility". Sadly the former affects OKD users, the latter affects (or at-least used to affect, maybe with recent RHCOS versions it has been solved) upstream OCP Assisted Installer agent users. I think there is no "right" answer between those two options, they're both bound to break (and have in the past), we've just chosen to solve the latter due to a user complaint a while ago, but we've done so in a problematic manner (mount), creating this issue for OKD users.

But we can do something else - ideally the solution here would be to disable the host's chronyd systemd service and have an equivalent, containerized chronyd service, but that's a big change. We should consider this probably

@omertuc
Copy link
Contributor

omertuc commented Jun 26, 2022

Temporarily, as a workaround, we can solve it by not doing the bind when running on top of FCOS

@omertuc
Copy link
Contributor

omertuc commented Jun 26, 2022

Created https://issues.redhat.com/browse/MGMT-10937 to track the workaround / solution

@omertuc
Copy link
Contributor

omertuc commented Jun 26, 2022

cc @vrutkovs

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 25, 2022
@omertuc
Copy link
Contributor

omertuc commented Sep 26, 2022

/lifecycle frozen

@openshift-ci openshift-ci bot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

4 participants